In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.
Share this post
Alignment Faking in Large Language Models
Share this post
In our experiments, AIs will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences.