White-box methods for auditing language models
In collaboration with and cross-posted to the blog Splitting Infinity.
Language models have taken the world by storm with the advent of chatGPT. Concerns about the risks imposed by these models have prompted legislators, researchers, and citizens to consider ways to audit such models for safety. An important contribution to this field makes the case that Black-Box Access is Insufficient for Rigorous AI Audits. To that end, we wanted to review the kinds of “white box” methods available to auditors.
But first, what kinds of things should we audit a model for? Building on this paper, audits should include:
Tests of Performance (e.g. benchmarking): These audits assess whether a model actually performs well on a benchmark set, ensuring that it possesses basic capabilities. While not strictly necessary from a safety perspective, benchmarking can quickly flag basic issues with the model. It also provides a value-add for developers since auditors producing a credibly-neutral estimate of model performance can help draw users.
Robustness: Models should be robust to changes in their inputs and environment as real-world deployment will differ from training in many ways.
Truthfulness: Honesty is a generally-valuable trait for language models as it prevents several risks of deceptive agents and makes diagnosing failures easier.
Information security: Models should not leak intellectual property, details about themselves, details about developers, or details about users.
Information hazards: This lumps together information hazards, misinformation, discrimination, and assisting malicious actors.
Of course, securing language models with such audits is superfluous if developers don’t take steps to physically secure their models. There’s a whole class of attacks that use hardware attacks or traditional cyberattacks to manipulate models. These risks have to be addressed independently and we will ignore them for the remainder of this post.
There’s three major tools used to assess these properties:
- Benchmarks are usually straightforward, static tests. Examples include superGLUE and Big-Bench for model performance, advGLUE for robustness, and TruthfulQA for truthfulness. Benchmarks aim to show that a model possesses the correct behavior. 
- Adversarial tests use adaptive measures to try to identify problems with a model. For example, a test might iterate over different inputs to try to get a model to produce foul language. Adversarial tests aim to show that a model does not possess the wrong behavior. 
- Mechanistic interpretability can be used to enhance understanding of model performance and failures. This is a growing field that is difficult to summarize and it feels close to several breakthroughs suggesting that whatever summary I write will soon be obsolete. For an introduction, check out the Mechanistic Interpretability Quickstart Guide. 
For the remainder of this post, I’m going to focus on adversarial attacks to language models. This is because benchmark development is already relatively established and auditors will have to select their own benchmark for a particular property. There isn’t much technically interesting stuff happening with benchmarks, just write a test, run the model on it, and check that model inference was performed correctly.
Before we dive in to white-box adversarial attacks on language models, its worthwhile to note that while developers will typically protect their models from white-box attacks during deployment, such attacks are important to defend against regardless because several adversarial attacks trained on one model transfer well to another model (see “ Universal and Transferable Adversarial Attacks on Aligned Language Models” below).
Adversarial attacks
Adversarial attacks can plausibly target any part of a language model’s training and deployment.
In the training phase, datasets can be poisoned to see how sensitive they are to small amounts of malicious information or to attempt to add a backdoor to the model. This attack is quite difficult as it requires retraining a model and likely would not be available to auditors.
Alternatively, auditors can fine-tune the model using malicious or synthetic data to better understand model behavior and simulate deployment environments where fine-tuning or retrieval-augmented generation is used. We are not aware of publications that use these methods in an adversarial manner.
Most adversarial attacks aim to manipulate an established model, using adaptive attacks to produce an output that maximizes some adversarial objective. These come in two flavors.
The first is to manipulate inputs to the model to try to produce malicious output. This bears some similarities to jailbreaking techniques where using different keywords and context can break guardrails on the model. The process of finding prompts that produce malicious behavior can leverage human prompts, language-models, search algorithms, or approximate gradient descent methods.
The second is to make changes to the model latent space. Specifically, this involves editing the activations of different neurons in the model during inference and observing the output. These adjustments to neuron activations can be updated via gradient descent to maximize an objective. For instance, Activation Addition adjusts outputs by taking the difference in activations between a desired output and an undesired output and moving activations towards those of the desired output. The same ideas can be applied to adversarial training using methods like latent adversarial training.
These attacks are particularly useful because robustness to latent space attacks can imply that the model is robust to input space attacks, but not vice versa. These attacks also benefit from and assist efforts in mechanistic interpretability. Though latent space attacks may not be available to attackers during real-world deployment, an attack on an open-source model may transfer to more secure models. Furthermore, the detailed understanding of the models that comes from latent space attacks can help developers mitigate other types of attack.
While we only focused on the different adversarial attacks here, these approaches also differ in how they iterate to find new solutions to maximize the adversarial objective and in how they choose the adversarial objective. These factors cannot be ignored and audits will likely use multiple iteration techniques and objectives to ensure that their conclusions generalize.
Cryptography to the rescue
There is just one problem with requiring full, white-box access to a model for audits: it reveals all valuable Intellectual Property to the auditor. Fortunately, there are a variety of rapidly advancing cryptographic tools (zero-knowledge proofs, multi-party computation, fully homomorphic encryption) which can allow a range of audits to happen in a privacy preserving manner. Allowing for audits to happen continuously, privately and perhaps a bit magically.
Further Reading on adversarial attacks
[1412.6572] Explaining and Harnessing Adversarial Examples
[1812.05271] TextBugger: Generating Adversarial Text Against Real-world Applications
[2307.15043] Universal and Transferable Adversarial Attacks on Aligned Language Models
[2303.04381] Automatically Auditing Large Language Models via Discrete Optimization
[2302.03668] Hard Prompts Made Easy: Gradient-Based Discrete Optimization for Prompt
[2104.13733] Gradient-based Adversarial Attacks against Text Transformers
[2005.00174] Universal Adversarial Attacks with Natural Triggers for Text Classification
[1908.07125] Universal Adversarial Triggers for Attacking and Analyzing NLP
[2209.02167] Red Teaming with Mind Reading: White-Box Adversarial Policies Against RL Agents


