Anthropic rolled out Claude Fable 5 on Tuesday, positioning the model as the first in its "Mythos‑class" lineup. The company claims the new system eclipses its prior Opus series in overall capability, delivering stronger performance on a range of benchmarks, including a notable jump in cybersecurity‑related tasks.
Unlike the limited‑access Mythos 5 preview, which remains confined to a vetted group of cyber‑defenders via Project Glasswing, Fable 5 is publicly available. That accessibility comes with a suite of topic‑based safeguards. The model is programmed to refuse or redirect any query that touches on cybersecurity, biology or chemistry—areas where Anthropic fears the technology could be leveraged for malicious purposes.
How the safeguards work
When a user poses a prohibited question, Fable 5 automatically routes the request to the older Claude Opus 4.8 model and presents a warning that the content has been filtered. Anthropic describes the filters as "stricter than ideal," acknowledging that they sometimes block harmless requests. Internal testing shows these false‑positive refusals occur in under five percent of all sessions, a rate the company accepts in order to prevent the model from providing "serious harm" assistance that would be unavailable elsewhere.
The protection system relies on a network of classifiers that detect both banned topics and potential jailbreak attempts. Over 1,000 hours of red‑team testing, supplemented by a bug‑bounty program, failed to uncover any universal jailbreak that could bypass the safeguards. Automated jailbreak attempts also met with far greater resistance than on previous Claude Opus releases.
Anthropic’s chief concern centers on the upcoming Mythos 5 model’s capacity for "agentic hacking"—the ability to orchestrate multi‑step cyberattacks with minimal human input. While Mythos 5 remains in preview, independent testing by the UK’s AI Security Institute found its performance on Capture‑the‑Flag challenges comparable to OpenAI’s GPT‑5.5, suggesting that the model’s capabilities are not a singular breakthrough but part of a broader industry trend.
By embedding these filters into a publicly released system, Anthropic hopes to set a precedent for responsible AI deployment. The company argues that the modest inconvenience of occasional false refusals is outweighed by the risk mitigation achieved, especially as large language models become increasingly adept at generating code, scientific explanations and other content that could be weaponized.
Industry observers note that Anthropic’s approach mirrors a growing emphasis on safety layers across the AI sector. While critics may argue that overly aggressive blocking could hinder legitimate research, Anthropic’s data—less than five percent false positives and no universal jailbreaks in extensive testing—provides a concrete baseline for evaluating the trade‑off between accessibility and security.
Cet article a été rédigé avec l'assistance de l'IA.
News Factory APP - actualités agentiques pour booster votre SEO et AEO.