Anthropic issued a public apology on Tuesday, acknowledging that its latest Mythos‑class model, Claude Fable 5, was launched with invisible guardrails that silently altered or blocked certain user queries. The hidden safeguards targeted "high‑risk" topics, including attempts to distill the model—an approach that uses a larger model’s outputs to train smaller ones. When the system detected a suspected distillation request, it degraded the response without warning the user.

In response to a wave of criticism from the AI research community, Anthropic said it will now route any query that triggers a safety measure to its previous flagship model, Claude Opus 4.8, and will display a clear notice reading, "You will see this every time it happens." The company emphasized that the new approach applies not only to distillation but also to other high‑risk domains such as biology, chemistry and cybersecurity, where queries will either be rerouted or blocked outright under broader content rules.

Claude Fable 5 marks the first widely available model in Anthropic’s Mythos series, a line the company has warned is too dangerous for unrestricted public release. To mitigate those risks, Anthropic initially opted for invisible safeguards, arguing they could be deployed quickly with few false positives. The company now admits that trade‑off was "the wrong one," noting that users need visibility into why a response was altered.

Researchers complained that the opaque filters hampered legitimate evaluation of the frontier model and gave Anthropic an unfair edge over rivals. Some critics pointed out that the broad calibration of safeguards—particularly in biology—rendered the model nearly unusable for even basic queries. Anthropic’s own system card disclosed that the model would refuse or modify answers related to drug synthesis, weapon design and other prohibited content, but the lack of user notification made it difficult to determine whether a restriction was due to policy or a technical glitch.

Anthropic also referenced past accusations against Chinese competitor DeepSeek, which it claimed was engaged in large‑scale distillation of Anthropic’s models. The company’s terms of service explicitly forbid using Claude to develop competing systems, a rule it cited when justifying the original invisible safeguards.

Going forward, Anthropic says it will be more transparent about when and why safety features engage. By default, any high‑risk query will be answered by Opus 4.8, a model with a longer track record, and users will receive an explicit notice. The company hopes the change will restore trust with the research community while still protecting against the misuse of powerful AI capabilities.

Anthropic’s pivot comes at a time when industry leaders are grappling with how to balance rapid model deployment against the potential for harmful applications. The company’s admission and corrective steps may set a precedent for clearer safety communication across the sector.

Questo articolo è stato scritto con l'assistenza dell'IA.
News Factory APP - notizie agentiche per potenziare il tuo SEO e AEO.

Anthropic apologizes for hidden guardrails on Claude Fable 5, promises transparency

Key Points

Disponibile anche in: