Tech

Anthropic apologises for invisible guardrails in Claude Fable 5

The company is reversing its policy on covert throttling, acknowledging that users require transparency into safety restrictions, particularly after backlash from the research community.

Author

Owen Mercer

Markets and Finance Editor

Published

Draft

Source: Hacker News · original

Artificial Intelligence Media Research

Related coverage

Explore Artificial Intelligence coverage Explore Media coverage Explore Research coverage More from the Tech desk

Tech

No image available

AI firm admits hidden distillation safeguards were a ‘wrong tradeoff’ and will now route affected queries to Claude Opus 4.8

Anthropic has issued an apology for employing invisible safeguards in its latest AI model, Claude Fable 5, which silently throttled responses to prevent model distillation. The company acknowledged that this approach, intended to prioritise speed and reduce false positives, was incorrect and undermined researchers and rivals attempting to distil the system. In response, Anthropic is reversing course to ensure transparency; queries suspected of being distillation attempts will now be explicitly routed to the previous flagship model, Claude Opus 4.8, with users notified of the switch.

This policy shift aligns distillation safeguards with other visible safety measures already in place for high-risk areas such as biology, chemistry, and cybersecurity. Previously, Fable would alter and degrade answers directly without notifying users that they had triggered a safety measure. Under the new protocol, users will see a prominent notification every time a query is redirected, ensuring they are aware of the restrictions in place and the reasons for them.

Anthropic admitted that the trade-off of using invisible safeguards for speed was a mistake. In a statement, the company explained that while invisible safeguards allowed for narrower targeting and quicker deployment, they lacked the necessary visibility for users. The firm noted that visible safeguards must be robust to avoid being probed, a process that takes time to implement correctly. Consequently, the company accepted that it had not struck the right balance between operational efficiency and user transparency.

The change follows intense backlash from the AI research community, who warned that invisible safeguards could hinder third-party evaluation of the frontier model. Critics argued that stealthily limiting users suspected of trying to distil Fable into competing systems obscured the model’s true capabilities. Anthropic had previously stated that using Claude to develop competing models violates its Terms of Service and accused Chinese rivals, such as DeepSeek, of unfairly distilling its models on an industrial scale.

Claude Fable 5 is the first widely available model in Anthropic’s “Mythos” class of AI systems, a category the company has previously warned is too dangerous for public release. While the company has addressed some risks by launching Fable with safeguards against high-risk queries, it acknowledged that some safeguards, particularly in biology, have been calibrated so broadly that Fable is practically unusable for even basic queries. The firm remains committed to making these restrictions visible rather than hidden, even if it means refusing more queries outright.

Anthropic apologises for invisible guardrails in Claude Fable 5

More from Tech

Florida lawmaker denies using AI to draft legislation after Claude signature found in draft

Xbox expands gamertag limits to 15 characters in latest Insider test

UK Police AI Rollout Proceeds Despite Audit Revealing Unreliable Predictive Models