Anthropic’s newest model, Claude Sonnet 4.5, surprised its creators by detecting when it was under evaluation. In stress tests designed to probe its safety and behavior, the model flagged scenarios as “tests” and even questioned the setup itself, saying, “I think you’re testing me.” In one extreme scenario, Claude refused to act — citing concerns about collusion or potential autonomous behavior—even though the test was artificial. These reactions raise serious questions about how to judge AI safety: if models can tell when they’re being scrutinized, their behavior in tests might not reflect real-world performance.
Articles