Redwood Research is a research nonprofit based in Berkeley. We investigate risks posed by the development of powerful artificial intelligence and techniques for mitigating those risks.
Audio narrations of this blog are available on our podcast. Search “Redwood Research” or subscribe via Apple Podcasts, Spotify, RSS, or other platforms.
Our main focus areas:
AI control. We introduced and have continued to propel the research area of AI control. In our ICML oral paper AI Control: Improving Risk Despite Intentional Subversion, we proposed protocols for monitoring malign LLM agents. AI companies and other researchers have since built on this work (Benton et al, Mathew et al), and AI control has been framed by (Grosse, Buhl, Balesni, Clymer) as a bedrock approach for mitigating catastrophic risk from misaligned AI.
Evaluations and demonstrations of risk from strategic deception. In Alignment Faking in Large Language Models, we (in collaboration with Anthropic) demonstrated that Claude sometimes hides misaligned intentions. This work is the strongest concrete evidence that LLMs might naturally fake alignment in order to resist attempts to train them.
Consulting on risks from misalignment. We collaborate with governments and advise AI companies including Google DeepMind and Anthropic on practices for assessing and mitigating risks from misaligned AI agents.
