Why it's hard to make settings for…

Jul 18

It's like making challenging evals, but more constrained

4 Comments

Why do the main tasks need to be difficult? I’d expect side tasks to be much easier to complete if the main tasks are actually really easy and the model can spend more time on a side task without seeming suspicious. Is it that you expect this to only matter for frontier research?

Expand full comment

Reply (1)

Buck Shlegeris

If the main tasks are easy, then the strategy of "just use the trusted model" is very strong, in a way that I suspect will be unrepresentative of the challenging tasks we'll want to give frontier AIs later.

Expand full comment

Reply (1)

Daniel Juhl

Given that we are constructing the experiment and the protocol deliberately, can't we commit to always using the untrusted model, even if not strictly necessary for completing the task?

If a trusted monitor is able to solve the task on its own, that might make the monitoring and detection of untrusted shenanigans much more effective than when the monitor cannot itself solve the task. This might lead to overconfidence in the efficacy of the monitor, but this seems like a separate issue from the one you describe.

Expand full comment

Cyberneticist

Great stuff. Keep up the good work.

Expand full comment

Redwood Research blog

Why it's hard to make settings for…