Discussion about this post

User's avatar
Clarity Check's avatar

This article offers a vital clarification in discussions about AI safety, carefully examining the common but not very clear idea of AI models appearing to behave as intended. By explaining the differences between limitations based on how AIs are taught and how they are watched, he clarifies how things actually work to encourage AI to act in certain ways. This careful examination, typical of a very logical and efficient way of thinking, is a useful tool for researchers trying to create better ways to prevent problems by understanding the subtle complexities of how AI acts and what we can actually see.

However, the article also deals with a deeper, unmentioned problem: the difficulty in really knowing what an AI is thinking or planning just by watching what it does. This hidden issue—the mistaken belief that we can easily see what complex systems are truly intending—remains a basic unclear point. While your detailed and systematic analysis provides essential clarity on how behavior is limited, it also highlights the lasting challenge in knowing "what the model would have done if it was intended to be helpful." This makes the AI safety field think more deeply about what we can and cannot know, and about the need to be truly honest about the underlying assumptions and limitations in how we approach making AI behave correctly.

Expand full comment

No posts