Redwood Research blog
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
Measuring whether AIs can statelessly strategize to subvert security measures
The complement to control evaluations
Dec 20, 2024
•
Buck Shlegeris
and
Alex Mallen
3
Share this post
Redwood Research blog
Measuring whether AIs can statelessly strategize to subvert security measures
Copy link
Facebook
Email
Notes
More
Alignment Faking in Large Language Models
In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its…
Dec 18, 2024
•
Ryan Greenblatt
and
Buck Shlegeris
12
Share this post
Redwood Research blog
Alignment Faking in Large Language Models
Copy link
Facebook
Email
Notes
More
November 2024
Why imperfect adversarial robustness doesn't doom AI control
There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes.
Nov 18, 2024
•
Buck Shlegeris
5
Share this post
Redwood Research blog
Why imperfect adversarial robustness doesn't doom AI control
Copy link
Facebook
Email
Notes
More
1
Win/continue/lose scenarios and execute/replace/audit protocols
In this post, I’ll make a technical point that comes up when thinking about risks from scheming AIs from a control perspective.
Nov 15, 2024
•
Buck Shlegeris
5
Share this post
Redwood Research blog
Win/continue/lose scenarios and execute/replace/audit protocols
Copy link
Facebook
Email
Notes
More
October 2024
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
One strategy for mitigating risk from schemers (that is, egregiously misaligned models that intentionally try to subvert your safety measures) is…
Oct 10, 2024
•
Buck Shlegeris
5
Share this post
Redwood Research blog
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Copy link
Facebook
Email
Notes
More
September 2024
A basic systems architecture for AI agents that do autonomous research
And diagrams describing how threat scenarios involving misaligned AI involve compromising the system in different places.
Sep 26, 2024
•
Buck Shlegeris
6
Share this post
Redwood Research blog
A basic systems architecture for AI agents that do autonomous research
Copy link
Facebook
Email
Notes
More
How to prevent collusion when using untrusted models to monitor each other
Suppose you’ve trained a really clever AI model, and you’re planning to deploy it in an agent scaffold that allows it to run code or take other actions.
Sep 25, 2024
•
Buck Shlegeris
1
Share this post
Redwood Research blog
How to prevent collusion when using untrusted models to monitor each other
Copy link
Facebook
Email
Notes
More
August 2024
Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
I'm not so sure.
Aug 26, 2024
•
Buck Shlegeris
5
Share this post
Redwood Research blog
Would catching your AIs trying to escape convince AI developers to slow down or undeploy?
Copy link
Facebook
Email
Notes
More
1
Fields that I reference when thinking about AI takeover prevention
Is AI takeover like a nuclear meltdown? A coup? A plane crash?
Aug 13, 2024
•
Buck Shlegeris
15
Share this post
Redwood Research blog
Fields that I reference when thinking about AI takeover prevention
Copy link
Facebook
Email
Notes
More
June 2024
Getting 50% (SoTA) on ARC-AGI with GPT-4o
You can just draw more samples
Jun 17, 2024
•
Ryan Greenblatt
119
Share this post
Redwood Research blog
Getting 50% (SoTA) on ARC-AGI with GPT-4o
Copy link
Facebook
Email
Notes
More
46
Access to powerful AI might make computer security radically easier
AI might be really helpful for reducing security risk.
Jun 10, 2024
•
Buck Shlegeris
Share this post
Redwood Research blog
Access to powerful AI might make computer security radically easier
Copy link
Facebook
Email
Notes
More
AI catastrophes and rogue deployments
It’s interesting to classify possible AI catastrophes based on whether or not they involve a "rogue deployment".
Jun 3, 2024
•
Buck Shlegeris
Share this post
Redwood Research blog
AI catastrophes and rogue deployments
Copy link
Facebook
Email
Notes
More
Share
Copy link
Facebook
Email
Notes
More
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts