Discussion about this post

User's avatar
Miles Kodama's avatar

The interpretability community could also take this essay as a challenge to up their scientific communication game. We want to be able to say "our deception probe is firing suspiciously" in a sufficiently accessible and compelling way that the relevant people understand what this means and why it's alarming.

Expand full comment
Yixiong Hao's avatar

Generally agreed, I think more people

should be working on the demos + organisational outreach part of the theory of change.

What do you mean by “robust enough to be iterated against” at the very end? Is there an important distinction between iterate and optimise?

On priors, optimising against our technique of detection seems bad. One likely outcome is obfuscated misalignment that our technique is no longer able to detect (https://thezvi.substack.com/p/the-most-forbidden-technique). In this sense, is it even possible to have a robust detection technique?

Expand full comment
1 more comment...

No posts