Redwood Research blog
Subscribe
Sign in
Home
Archive
About
Latest
Top
Discussions
How might we safely pass the buck to AI?
Developing AI employees that are safer than human ones
Feb 19
•
Josh Clymer
6
Share this post
Redwood Research blog
How might we safely pass the buck to AI?
Copy link
Facebook
Email
Notes
More
January 2025
Takeaways from sketching a control safety case
Insights from a long technical paper compressed into a fun little commentary
Jan 30
•
Josh Clymer
4
Share this post
Redwood Research blog
Takeaways from sketching a control safety case
Copy link
Facebook
Email
Notes
More
Planning for Extreme AI Risks
Are we ready for this?
Jan 29
•
Josh Clymer
13
Share this post
Redwood Research blog
Planning for Extreme AI Risks
Copy link
Facebook
Email
Notes
More
1
Ten people on the inside
A scary scenario that's worth planning for
Jan 28
•
Buck Shlegeris
8
Share this post
Redwood Research blog
Ten people on the inside
Copy link
Facebook
Email
Notes
More
When does capability elicitation bound risk?
The assumptions behind and limitations of capability elicitation have been discussed in multiple places (e.g.
Jan 22
•
Josh Clymer
3
Share this post
Redwood Research blog
When does capability elicitation bound risk?
Copy link
Facebook
Email
Notes
More
How will we update about scheming?
A quantitative description of how I expect to change my mind.
Jan 19
•
Ryan Greenblatt
6
Share this post
Redwood Research blog
How will we update about scheming?
Copy link
Facebook
Email
Notes
More
Thoughts on the conservative assumptions in AI control
Why are we so friendly to the red team?
Jan 17
•
Buck Shlegeris
3
Share this post
Redwood Research blog
Thoughts on the conservative assumptions in AI control
Copy link
Facebook
Email
Notes
More
Extending control evaluations to non-scheming threats
Buck Shlegeris and Ryan Greenblatt originally motivated control evaluations as a way to mitigate risks from ‘scheming’ AI models: models that…
Jan 13
•
Josh Clymer
7
Share this post
Redwood Research blog
Extending control evaluations to non-scheming threats
Copy link
Facebook
Email
Notes
More
December 2024
Measuring whether AIs can statelessly strategize to subvert security measures
The complement to control evaluations
Dec 20, 2024
•
Buck Shlegeris
and
Alex Mallen
4
Share this post
Redwood Research blog
Measuring whether AIs can statelessly strategize to subvert security measures
Copy link
Facebook
Email
Notes
More
Alignment Faking in Large Language Models
In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its…
Dec 18, 2024
•
Ryan Greenblatt
and
Buck Shlegeris
16
Share this post
Redwood Research blog
Alignment Faking in Large Language Models
Copy link
Facebook
Email
Notes
More
November 2024
Why imperfect adversarial robustness doesn't doom AI control
There are crucial disanalogies between preventing jailbreaks and preventing misalignment-induced catastrophes.
Nov 18, 2024
•
Buck Shlegeris
5
Share this post
Redwood Research blog
Why imperfect adversarial robustness doesn't doom AI control
Copy link
Facebook
Email
Notes
More
3
Win/continue/lose scenarios and execute/replace/audit protocols
In this post, I’ll make a technical point that comes up when thinking about risks from scheming AIs from a control perspective.
Nov 15, 2024
•
Buck Shlegeris
6
Share this post
Redwood Research blog
Win/continue/lose scenarios and execute/replace/audit protocols
Copy link
Facebook
Email
Notes
More
Share
Copy link
Facebook
Email
Notes
More
This site requires JavaScript to run correctly. Please
turn on JavaScript
or unblock scripts