Tech

Anthropic apologises for invisible Claude Fable 5 guardrails after researcher backlash

Following intense criticism from the AI research community, Anthropic has reversed its policy on hidden safety measures in its new Mythos-class model, acknowledging the approach was the wrong tradeoff.

Author

Owen Mercer

Markets and Finance Editor

Published

Draft

Source: The Verge · original

Artificial Intelligence Policy

Related coverage

Explore Artificial Intelligence coverage Explore Policy coverage More from the Tech desk

Anthropic apologizes for invisible Claude Fable guardrails

AI firm admits covert distillation safeguards hindered evaluation and will now route affected queries to Opus 4.8 with explicit user notification

Anthropic has apologised for deploying invisible guardrails in its new AI model, Claude Fable 5, which silently degraded responses to suspected distillation attempts. The company acknowledged this approach was the "wrong tradeoff" following intense backlash from the AI research community, who argued that covert safeguards hindered model evaluation and development. Under the revised policy, queries identified as distillation attempts will be routed to the previous flagship model, Claude Opus 4.8, with users explicitly notified of the switch. This transparency measure aims to restore trust, although Anthropic noted that using Claude to develop competing models already violates its Terms of Service.

Claude Fable 5 is the first widely available model in Anthropic’s "Mythos" class of AI systems, a group the company has previously warned are too dangerous for public release. To manage risks, Anthropic implemented safeguards that prevent the model from responding to certain high-risk queries. One such area is distillation, a technique used to train smaller AI models using the outputs of larger ones. Initially, the company’s system card stated it would alter and degrade answers for suspected distillation attempts without notifying users that they had triggered a safety measure.

The decision to use invisible safeguards was driven by a desire to ship quickly with fewer false positives. In a statement provided to The Verge, Anthropic explained that visible safeguards can be probed and must be robust, which takes time to implement correctly. However, the company conceded that this lack of transparency was a mistake. "We made the wrong tradeoff and we apologize for not getting the balance right," the firm said, adding that users should have visibility into the safeguards in place and the reasons for them.

Under the new approach, queries identified as distillation attempts will no longer be silently degraded. Instead, they will be routed to Claude Opus 4.8, and users will receive a prominent notification every time this occurs. This aligns with how Fable handles other high-risk areas such as biology, chemistry, and cybersecurity, where queries are either routed through Opus 4.8 or blocked outright. Anthropic acknowledged that some safeguards, particularly in biology, had been calibrated so broadly that Fable was "practically unusable" for basic queries.

The policy shift comes after researchers reported that the undisclosed restrictions sabotaged their work and wasted resources. While Anthropic maintains that using Claude to develop competing models violates its Terms of Service and has previously accused rivals like DeepSeek of industrially distilling its models, the company recognises that the previous method undermined legitimate evaluation efforts. The change marks a significant adjustment in how Anthropic balances safety with transparency in its frontier AI development.

Anthropic apologises for invisible Claude Fable 5 guardrails after researcher backlash

More from Tech

Florida lawmaker denies using AI to draft legislation after Claude signature found in draft

Xbox expands gamertag limits to 15 characters in latest Insider test

UK Police AI Rollout Proceeds Despite Audit Revealing Unreliable Predictive Models