Reading List
(Last updated Jul 9 2025)
Section 1 is a quick guide to the key ideas in AI control, aimed at readers who want to get up to speed as quickly as possible.
Section 2 is an extensive guide to almost all of our writing on AI risk, aimed at those who want to gain a deep understanding of Redwood’s worldview.
1. Key Ideas in AI Control
Papers and blog posts
Foundational content
The case for ensuring that powerful AIs are controlled (Buck Shlegeris and Ryan Greenblatt, May 2024): This is our original essay explaining why we think AI control is a valuable research direction.
AI Catastrophes And Rogue Deployments (Buck Shlegeris, Jun 2024): We often refer to rogue deployments when talking about what AI control aims to prevent.
[external] Scheming AIs: Will AIs fake alignment during training in order to get power? (Joe Carlsmith, Nov 2023): Read this for the current state of the art on arguments for why your models might end up conspiring against you. This is long; you might want to just read the first section, which summarizes the whole report.
Examples of empirical research
AI Control: Improving Safety Despite Intentional Subversion (Ryan Greenblatt et al., Dec 2023): This was the original paper on control.
Ctrl-Z: Controlling AI Agents via Resampling (Aryan Bhatt et al., Apr 2025): This is the sequel to the original control paper. Here, we study high-stakes failures in an agent setting.
In-depth analyses of threat models and countermeasures, and implications for control methodology
Win/continue/lose scenarios and execute/replace/audit protocols (Buck Shlegeris, Nov 2024): The basic mathematical ideas of this post are crucial for thinking quantitatively about the main problems with using AI control to prevent security failures.
Thoughts on the conservative assumptions in AI control (Buck Shlegeris, Jan 2025): This describes some of the ways that most of our control work is conservative, and explains why we usually choose to do the research that way.
How to prevent collusion when using untrusted models to monitor each other (Buck Shlegeris, Sep 2024): This gets into a lot of detail about some important points about untrusted monitoring, a promising control technique.
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking (Julian Stastny and Buck Shlegeris, May 2025): Sandbagging and exploration hacking are fundamental issues which often come up when reasoning about AI control measures.
How can we solve diffuse threats like research sabotage with AI control? (Vivek Hebbar, Apr 2025): This explains an important class of threat models where catastrophe requires numerous failures, but each failure is only weak evidence of malign intent when investigated.
Catching AIs Red-Handed (Buck Shlegeris and Ryan Greenblatt, May 2024): It's crucial to think about what happens if the AI gets caught in order to think about plans that involve catching AIs red-handed.
Podcasts
Ryan On The 80000 Hours Podcast (Jul 2025): Ryan talks about AI R&D automation and ways misaligned AIs could take over. Most of Redwood’s work aims at reducing risk in the kinds of scenarios that he talks about in this podcast.
Buck On The 80000 Hours Podcast (Apr 2025): Buck talks about some detailed threat models and explains how AI control research can yield cost-effective techniques for reducing risk.
ARXP Podcast with Buck and Ryan (Apr 2024): Buck and Ryan make the basic case for AI control and explain the original AI control paper.
2. Extensive guide
Particularly recommended readings are bolded and listed first. As one might expect, some of our work is hard to categorize.
Threat models (zoomed out)
Reading some of these posts might be helpful context on Redwood’s worldview, including: how AI progress might develop, how this yields misalignment risk, and why we think AI control is promising within this broader context.
The case for ensuring that powerful AIs are controlled (Buck Shlegeris and Ryan Greenblatt, May 2024)
Clarifying AI R&D threat models (Josh Clymer, Apr 2025)
Why do misalignment risks increase as AIs get more capable? (Ryan Greenblatt, Apr 2025)
Prioritizing Threats For AI Control (Ryan Greenblatt, Mar 2025)
Planning For Extreme AI Risks (Josh Clymer, Jan 2025)
Ten People On The Inside (Buck Shlegeris, Jan 2025)
Forecasting AI trends
A breakdown of AI capability levels focused on AI R&D labor acceleration (Ryan Greenblatt, Dec 2024)
Ryan On The 80000 Hours Podcast (Jul 2025)
How much novel security-critical infrastructure do you need during the singularity? (Buck Shlegeris, Jul 2025)
What does 10x-ing effective compute get you? (Ryan Greenblatt, Jun 2025)
Slow corporations as an intuition pump for AI R&D automation (Ryan Greenblatt and Eli Lifland, May 2025)
What's going on with AI progress and trends? (As of 5/2025) (Ryan Greenblatt, May 2025)
A note by Ryan Greenblatt on the probability of software-only singularity (Jan 2025)
A basic systems architecture for AI agents that do autonomous research (Buck Shlegeris, Sep 2024)
Getting 50% (SoTA) on ARC-AGI with GPT-4o (Ryan Greenblatt, Jun 2024)
Access to powerful AI might make computer security radically easier (Buck Shlegeris, Jun 2024)
Threat models (zoomed in)
At Redwood, we spend a lot of time performing detailed threat modelling with a focus on scheming AIs, as this informs our prioritization when developing plans for reducing catastrophic misalignment risk.
How could scheming AIs make things difficult for us?
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking (Julian Stastny and Buck Shlegeris, May 2025)
How can we solve diffuse threats like research sabotage with AI control? (Vivek Hebbar, Apr 2025)
AI Catastrophes And Rogue Deployments (Buck Shlegeris, Jun 2024)
Auditing failures vs concentrated failures (Ryan Greenblatt and Fabien Roger, Dec 2023)
What's worse, spies or schemers? (Buck Shlegeris and Julian Stastny, Jul 2025)
Comparing risk from internally-deployed AI to insider and outsider threats from humans (Buck Shlegeris, Jun 2025)
Buck On The 80000 Hours Podcast (Apr 2025)
Measuring whether AIs can statelessly strategize to subvert security measures (Alex Mallen et al., Dec 2024)
Why imperfect adversarial robustness doesn't doom AI control (Buck Shlegeris, Nov 2024)
Untrusted smart models and trusted dumb models (Buck Shlegeris, May 2024)
How likely are scheming AIs?
Alignment Faking In Large Language Models (Ryan Greenblatt et al., Dec 2024)
Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming (Buck Shlegeris, Oct 2024)
To be legible, evidence of misalignment probably has to be behavioral (Ryan Greenblatt, Apr 2025)
How will we update about scheming? (Ryan Greenblatt, Jan 2025)
How might schemers function?
Training-time schemers vs behavioral schemers (Alex Mallen, May 2025)
There are two fundamentally different constraints on schemers (Buck Shlegeris, Jul 2025)
Two proposed projects on abstract analogies for scheming (Julian Stastny, Jul 2025)
When does training a model change its goals? (Vivek Hebbar and Ryan Greenblatt, Jun 2025)
How training-gamers might function (and win) (Vivek Hebbar, Apr 2025)
How can we control misaligned AIs?
We perform a mix of conceptual and empirical research on ways to control scheming AIs. Most of our work is focused on developing control protocols as well as settings in which control protocols can be empirically evaluated.
The main plan: develop control techniques
We are interested in techniques that catch scheming AIs (“incrimination”), which is particularly promising when they might cause egregious failures—as well as techniques for eliciting high-quality work from schemers (e.g. using online training), which is particularly important when AIs can subtly sabotage us.
AI Control: Improving Safety Despite Intentional Subversion (Ryan Greenblatt et al., Dec 2023)
Ctrl-Z: Controlling AI Agents via Resampling (Aryan Bhatt et al., Apr 2025)
How to prevent collusion when using untrusted models to monitor each other (Buck Shlegeris, Sep 2024)
Win/continue/lose scenarios and execute/replace/audit protocols (Buck Shlegeris, Nov 2024)
Thoughts on the conservative assumptions in AI control (Buck Shlegeris, Jan 2025)
7+ Tractable Directions In AI Control (Ryan Greenblatt and Julian Stastny, Apr 2025)
An Overview Of Areas Of Control Work (Ryan Greenblatt, Apr 2025)
An Overview Of Control Measures (Ryan Greenblatt, Apr 2025)
Notes on countermeasures for exploration hacking (aka sandbagging) (Ryan Greenblatt, Apr 2025)
Notes on handling non-concentrated failures with AI control: high level methods and different regimes (Ryan Greenblatt, Mar 2025)
Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats (Jiaxin Wen et al., Nov 2024)
Fields that I reference when thinking about AI takeover prevention (Buck Shlegeris, Aug 2024)
Stress-Testing Capability Elicitation With Password-Locked Models (Ryan Greenblatt et al., May 2024)
Spreading our bets
Some of our work involves coming up with contingency plans as well as more speculative bets.
Making Deals With Early Schemers (Julian Stastny et al., Jun 2025)
AIs at the current capability level may be important for future safety work (Ryan Greenblatt, May 2025)
Jankily Controlling Superintelligence (Ryan Greenblatt, Jun 2025)
AI Safety Techniques Leveraging Distillation (Ryan Greenblatt, Jun 2025)
The case for countermeasures to memetic spread of misaligned values (Alex Mallen, May 2025)
How might we safely pass the buck to AI? (Josh Clymer, Feb 2025)
Managing catastrophic misuse without robust AI (Ryan Greenblatt and Buck Shlegeris, May 2024)
Preventing model exfiltration with upload limits (Ryan Greenblatt, May 2024)
Safety cases
In an ideal world, AI developers will make extensive safety cases to ensure that their deployments are safe. Control methods have the advantage of being particularly well-suited for making quantitative safety estimates.
A sketch of an AI control safety case (Tomek Korbak et al., Jan 2025)
Takeaways from sketching a control safety case (Josh Clymer, Jan 2025)
AI companies are unlikely to make high-assurance safety cases if timelines are short (Ryan Greenblatt, Jan 2025)
When does capability elicitation bound risk? (Josh Clymer, Jan 2025)
Extending control evaluations to non-scheming threats (Josh Clymer, Jan 2025)
So we caught a scheming AI, what now?
We hope that catching a scheming AI will drastically change how the world treats risks from misaligned AI, but we have some contingency plans in case this doesn’t happen.
Catching AIs Red-Handed (Buck Shlegeris and Ryan Greenblatt, May 2024)
Handling schemers if shutdown is not an option (Buck Shlegeris, Apr 2025)
Prefix cache untrusted monitors: a method to apply after you catch your AI (Ryan Greenblatt, Jun 2025)
Would catching your AIs trying to escape convince AI developers to slow down or undeploy? (Buck Shlegeris, Aug 2024)