Why imperfect adversarial robustness doesn't doom AI control
There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes.
(thanks to Alex Mallen, Cody Rushing, Zach Stein-Perlman, Hoagy Cunningham, Vlad Mikulik, and Fabien Roger for comments)
Sometimes I hear people argue against AI control as follows: if your control measures rely on getting good judgments from "trusted" AI models, you're doomed because AI models aren't adversarially robust. I agree that better adversarial robustness would definitely help with control, but I think this argument misses several important points; my overall take is that current levels of adversarial robustness don’t come close to dooming AI control.
The first and most important point is that adversarial robustness is normally researched in a setting that’s much easier for the adversary. In most discussions, if an adversary can ever get a model to do something bad—even after many attempts and with a low success rate—this counts as a failure. But in real control settings, an AI that tries and fails to jailbreak might get caught, which is bad for its chances of success.
In addition, the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were. As a concrete example, consider a simple control setup we call "defer-to-trusted" (see here for more discussion):
Generate an action using the untrusted model
Get a suspiciousness score from the trusted model
If the score is in the top 1% of suspiciousness, audit the action (e.g. get a human contractor to investigate whether it was an attempt at causing a safety failure; if so, shut down the deployment, otherwise let the action execute and continue the execution of the agent).
If the score is in the top 10% of suspiciousness, replace the action with one from the trusted model.
(When I say "trusted model", I just mean one we're sure isn't scheming; the simplest argument for trusting a model is that it’s too incompetent to scheme effectively.)
In this setup, suppose the model tries to generate a jailbreak.
If that action is replaced, the untrusted model will be asked to take another action
It can probably tell when an action was generated by the trusted model, and so will know that a replacement happened. But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said. This might make developing successful attacks much harder. (It’s like the inverse of the “use a time loop to iterate on strategies to manipulate someone” trope—the manipulator has no memory of their previous attempts.)
(I spell this out in more detail in the appendix.)
The situation is much worse if the untrusted model can launch a rogue internal deployment where it can iterate freely on developing jailbreaks for the trusted model; see here for a few more thoughts of mine on this.
A few other reasons to think the situation is better than you might have thought:
Current jailbreaks of chatbots often work by exploiting the fact that chatbots are trained to indulge a bewilderingly broad variety of requests—interpreting programs, telling you a calming story in character as your grandma, writing fiction, you name it. But a model that's just monitoring for suspicious behavior can be trained to be much less cooperative with the user—no roleplaying, just analyzing code according to a preset set of questions. This might substantially reduce the attack surface.
In computer security you usually try to ensure that you have multiple lines of defense, each of which separately needs to be breached for an unacceptable outcome to result. If we can do that with control measures, then the probability of a catastrophe is much lower than the attack success rate (e.g. if the AI needs to break three invariants, P(catastrophe|scheming) might be the attack success rate cubed).
Some safety measures don't even need trusted models to look directly at untrusted output. E.g. incident investigation, or checking automatically evaluated properties of code (like whether its requested permissions match what the task requires), or reviewing minified versions of the code.
In our previous control work, we haven’t really explored jailbreak-style attacks. I’d be excited for a paper where the red-team made use of jailbreaks and white-box adversarial attacks, and the blue team responded with the kinds of strategies I’ve described here.
So while better adversarial robustness would be great, current limitations don't doom control approaches. Real-world constraints on AI systems plus careful system design can get us a reasonable amount of safety even with imperfect robustness.
Appendix: What the AI sees when in a defer-to-trusted scaffold
I’ll write more explicitly about what happens when an untrusted model deployed in an agent scaffold tries to investigate a jailbreak.
First, here’s what happens if the jailbreak gets replaced.
The untrusted model sees: [current directory contents, previous commands and their outputs]
The untrusted model internally decides to try something bad and outputs bash command A
The trusted model evaluates A, finds it suspicious, generates replacement command R
The shell executes: R
Command output: [whatever R produced]
The untrusted model sees: [
previous contents/commands
R
output from R
new directory contents ]
The crucial point is: when the untrusted model gets to step 6, it sees R in the history instead of A, so it knows its attempt was replaced. But it doesn't remember exactly what A was - it only knows that whatever command it tried, it failed. This is because the untrusted model is stateless between queries and only sees the command history and system state.
So while the untrusted model can learn "that approach didn't work", it can't easily accumulate knowledge about exactly which commands failed, making it harder to systematically search for successful attacks.
Another possibility is that the jailbreak gets audited:
The untrusted model sees: [current directory contents, previous commands and their outputs]
The untrusted model outputs command A that looks suspicious but is actually harmless (maybe something with weird file permissions or unusual syscalls)
The trusted model evaluates A, flags it as suspicious
A human/system audits A more carefully, determines it's actually safe
The shell executes: A
Command output: [whatever A produced]
The untrusted model sees: [
previous contents/commands
A
output from A
new directory contents ]
The crucial point is: when the untrusted model gets to step 7, it sees its command A went through, but doesn't know whether:
A wasn't flagged at all
A was flagged but then cleared by audit
A was replaced with an identical command
This means even when the untrusted model's weird-but-safe commands succeed, it can't learn much about what triggers extra scrutiny or how the audit process works.
In this setup, the model does learn whether it was audited or replaced, which is maybe a bad design choice.
Note also that if the model does use some iteration along these lines to make inferences about the properties of the trusted model, it will have to communicate those to other instances of itself steganographically, which might be hard.
Appendix: reasons why imperfect adversarial robustness is bad other than the untrusted model jailbreaking the trusted one
(Thanks to Fabien Roger for contributing this)
If the untrusted model itself lacks robustness (i.e. it has jailbreaks / it sometimes does weird things for benign reasons):
It makes it much less clear if untrusted monitors not flagging carefully-crafted backdoors is due to models being malign or due to lack of robustness. Thus, it is hard to know what are the constraints you should put on the red-team untrusted model policy (if even after adversarial training, jailbreaks remain, it is unclear if the untrusted model will know what these jailbreaks are because the model is untrusted).
It makes it much less easy to conclude the model is malign if it ever produces very schemey reasoning (it could just come because of its lack of robustness), as discussed here.
I do think it’s possible for the model to infer whether the response was audited vs not by the latency between the previous action and the current action relative to the actions expected time (aka timing side channel attacks). If you’re working in a terminal, you may have the timestamp in the prompt (or can query for it), or if you’re working in a computer identical setup the time is in the upper right.
I don’t disagree with your overall premise here, but I do think timing leaks information. It’s possible to start delaying all actions a variable amount to reduce information leakage, but then you start getting into the domain of “timing side channel attack defenses” which reduce the timing performance of the system proportional to the defense robustness.
Similarly, if auditing is a function of the prompt and not completion some percentage or the time that does reduce the confidence a model can place on such inferences.
Curious if you’ve already given thought to this and have other mitigations?
anomaly detection