Why do the main tasks need to be difficult? I’d expect side tasks to be much easier to complete if the main tasks are actually really easy and the model can spend more time on a side task without seeming suspicious. Is it that you expect this to only matter for frontier research?
If the main tasks are easy, then the strategy of "just use the trusted model" is very strong, in a way that I suspect will be unrepresentative of the challenging tasks we'll want to give frontier AIs later.
Given that we are constructing the experiment and the protocol deliberately, can't we commit to always using the untrusted model, even if not strictly necessary for completing the task?
If a trusted monitor is able to solve the task on its own, that might make the monitoring and detection of untrusted shenanigans much more effective than when the monitor cannot itself solve the task. This might lead to overconfidence in the efficacy of the monitor, but this seems like a separate issue from the one you describe.
Why do the main tasks need to be difficult? I’d expect side tasks to be much easier to complete if the main tasks are actually really easy and the model can spend more time on a side task without seeming suspicious. Is it that you expect this to only matter for frontier research?
If the main tasks are easy, then the strategy of "just use the trusted model" is very strong, in a way that I suspect will be unrepresentative of the challenging tasks we'll want to give frontier AIs later.
Given that we are constructing the experiment and the protocol deliberately, can't we commit to always using the untrusted model, even if not strictly necessary for completing the task?
If a trusted monitor is able to solve the task on its own, that might make the monitoring and detection of untrusted shenanigans much more effective than when the monitor cannot itself solve the task. This might lead to overconfidence in the efficacy of the monitor, but this seems like a separate issue from the one you describe.
Great stuff. Keep up the good work.