Why imperfect adversarial robustness doesn't…

Nov 18, 2024

There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes.

3 Comments

I do think it’s possible for the model to infer whether the response was audited vs not by the latency between the previous action and the current action relative to the actions expected time (aka timing side channel attacks). If you’re working in a terminal, you may have the timestamp in the prompt (or can query for it), or if you’re working in a computer identical setup the time is in the upper right.

I don’t disagree with your overall premise here, but I do think timing leaks information. It’s possible to start delaying all actions a variable amount to reduce information leakage, but then you start getting into the domain of “timing side channel attack defenses” which reduce the timing performance of the system proportional to the defense robustness.

Similarly, if auditing is a function of the prompt and not completion some percentage or the time that does reduce the confidence a model can place on such inferences.

Curious if you’ve already given thought to this and have other mitigations?

Expand full comment

Reply (1)

Buck Shlegeris

Jan 7

Great question! My guess is that it's not very expensive to have the model usually not know how much time has passed, so that it only gets this information at a fairly low rate. It's also possible that the audit threshold and the defer threshold (see https://www.lesswrong.com/posts/jg3PuE3fYL9jq9zHB/win-continue-lose-scenarios-and-execute-replace-audit) differ, and the model only learns about the latter.

Expand full comment

loveofdoing

Nov 18

anomaly detection

Expand full comment

Redwood Research blog

Why imperfect adversarial robustness doesn't…