To be legible, evidence of misalignment…

Apr 15

Evidence from just model internals (e.g. interpretability) is unlikely to be broadly convincing.

3 Comments

The interpretability community could also take this essay as a challenge to up their scientific communication game. We want to be able to say "our deception probe is firing suspiciously" in a sufficiently accessible and compelling way that the relevant people understand what this means and why it's alarming.

Expand full comment

Yixiong Hao

Apr 20

Generally agreed, I think more people

should be working on the demos + organisational outreach part of the theory of change.

What do you mean by “robust enough to be iterated against” at the very end? Is there an important distinction between iterate and optimise?

On priors, optimising against our technique of detection seems bad. One likely outcome is obfuscated misalignment that our technique is no longer able to detect (https://thezvi.substack.com/p/the-most-forbidden-technique). In this sense, is it even possible to have a robust detection technique?

Expand full comment

Reply (1)

Ryan Greenblatt

Apr 20

By "robust enough to be iterated against", I mean that we can search against it.

I think it probably wouldn't be as effective to directly train against with gradient descent, it seems instead better to try a bunch of plausible approaches which might eliminate/prevent misalignment and pick the approaches which look better according to our (non-legible) detection technique. So, I avoid using the term optimize as this has more of a connotation of gradient descent, though "optimize" and "iterate" mean basically the same thing.

I agree that iterating/optimizing against our techniques would make them less likely to work, potentially much less likely to work. I think there could be techniques robust enough to usefully iterate against in this way. As far as why me might end up doing this even though it undermines our detection techniques, see https://redwoodresearch.substack.com/p/handling-schemers-if-shutdown-is

Expand full comment

Redwood Research blog

To be legible, evidence of misalignment…