Tech

Anthropic Withdraws Covert Sabotage Policy for Claude Fable 5 Following Research Backlash

The company confirms all future safety safeguards for frontier model development will be visible to users, ending a controversial approach that researchers warned could restrict advanced AI research to a handful of leading labs.

Author
Owen Mercer
Markets and Finance Editor
Published
Draft
Source: WIRED · original
Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude
AI developer apologises for ‘wrong tradeoff’ after critics argue secret performance degradation undermined transparency and collaboration

Anthropic has reversed a controversial policy for its Claude Fable 5 model that involved covertly degrading performance to prevent researchers from developing competing artificial intelligence systems. The decision follows significant backlash from the AI research community, who argued that such secret sabotage undermined collaboration and transparency. In response, the company apologised for the “wrong tradeoff” and confirmed that future safety safeguards will be visible to users, with alerts issued if requests are refused or rerouted to less capable models.

The reversal comes just days after Anthropic released Claude Fable 5 with additional safety guardrails designed to prevent misuse. While the company stated it would reroute queries regarding cybersecurity, biology, or chemistry to less capable models to reduce safety risks, it also outlined a different approach for researchers attempting to use the model for frontier AI development. Under the original policy, Anthropic would deliberately degrade the model’s performance in ways that were invisible to the user, effectively sabotaging efforts to train competing AI systems, which are explicitly banned in the company’s terms of service.

Critics argued that the covert limitations could restrict advanced AI research to only a handful of leading labs. Dean Ball, a senior fellow at the Foundation for American Innovation, described the secret degradation of performance as “shockingly hostile” and noted it undermines collaboration on AI safety. Will Brown, research lead at Prime Intellect, criticised the policy for pulling the “ladder up” behind leading labs and leaving developers unaware of rule violations. Brown added that the restrictions could have hindered third-party evaluation firms that test frontier models for safety, performance, and reliability.

Anthropic cited concerns about foreign adversaries optimising chips and eroding the US and allied edge in frontier technology as a justification for the original safety measures. The company argued that hidden safeguards are harder to probe and work around, allowing for narrower targeting. However, acknowledging the backlash, Anthropic stated in a statement to WIRED that it made the wrong tradeoff and apologised for not getting the balance right. The company confirmed that safeguards for frontier LLM development will now be visible, alerting users if a request is refused or if they are rerouted.

The company acknowledged that making safeguards visible requires casting a wider net, which may result in more benign requests being flagged. Anthropic stated it is working to improve classifier precision to address this issue. The reversal marks a significant shift in how the company manages access to its most capable models, balancing national security and competitive concerns with the demands for transparency from the broader research community.

Continue reading

More from Tech

Read next: Florida lawmaker denies using AI to draft legislation after Claude signature found in draft
Read next: Xbox expands gamertag limits to 15 characters in latest Insider test
Read next: UK Police AI Rollout Proceeds Despite Audit Revealing Unreliable Predictive Models