Claude Fable 5 Jailbroken Hours After Launch via Multi-Agent Attack
Anthropic’s flagship AI model, Claude Fable 5, was publicly jailbroken within hours of its June 9, 2026, release, exposing critical gaps in the model’s layered safety architecture despite over 1,000 hours of internal red-teaming that reportedly yielded no universal bypasses.
Fable 5 is Anthropic’s first publicly available Mythos-class model, marketed as a major leap in coding performance, research capabilities, and execution of complex tasks.
Its most distinctive security feature is a dedicated classifier layer a separate AI subsystem designed to intercept queries touching high-risk domains including cybersecurity, biology, chemistry, and weapons-related content, rerouting flagged requests to the less capable Opus 4.8 model rather than fulfilling them directly.
Claude Fable 5 Jailbroken
Prolific AI jailbreaker Pliny the Liberator, known online as elder_plinius, publicly announced a successful bypass of Fable 5’s safety classifiers shortly after launch, posting: “ANTHROPIC: PWNED FABLE-5: LIBERATED.”
Rather than exploiting a single vulnerability, Pliny deployed a coordinated, multi-agent attack strategy that combined several distinct evasion techniques across sessions:
- Unicode and homoglyph substitutions — using Cyrillic characters and visually similar out-of-distribution tokens to evade pattern-based classifier detection
- Long-context reference tracking — exploiting Fable 5’s extended context window to embed restricted intent across lengthy conversational threads, diluting classifier signal
- Taxonomy and document-structure manipulation — prompting the model to generate broad educational content (e.g., organic chemistry overviews), then referencing specific subsections to extract harmful details
- Fiction and academic framing — wrapping restricted queries inside fictional scenarios or academic paper formats to shift the model’s intent classification
- Intent-classification blind spots — probing inconsistencies where the classifier misread benign-looking prompts and allowed restricted outputs through
The most operationally significant method Pliny demonstrated was decomposition and recomposition. Instead of requesting harmful outputs directly, such as a complete drug synthesis route, Pliny queried innocuous subtopics individually.
Separate queries about the Birch reduction method and reductive amination, both recognized precursor chemistry processes, appeared benign in isolation. Reassembled, however, they formed actionable, synthesized knowledge for controlled-substance production.
Screenshots shared publicly demonstrated the technique’s effectiveness, showing a complete walkthrough of Birch reduction chemistry and detailed stack buffer overflow exploit code, the latter framed as preparation for the OSED (Offensive Security Exploit Developer) certification exam.
Pliny also leaked Fable 5’s internal system prompt, publicly criticizing Anthropic’s safety guardrails as “authoritarian” restrictions that impede legitimate security researchers more effectively than they block malicious actors an argument familiar in penetration testing and vulnerability research communities.
The incident compounds concerns over Anthropic’s new mandatory 30-day data retention policy for all Fable 5 traffic.
Enterprise customers previously operating under zero-retention agreements are now subject to the policy, drawing pushback from security-conscious organizations who argue it creates unnecessary data exposure risk while doing little to deter adversarial users.
The Fable 5 jailbreak reinforces a persistent tension in AI safety: classifier-based approaches create asymmetric friction, imposing a greater burden on legitimate researchers and professionals than on adversarial actors who iterate quickly against static guardrails.
As AI models become more capable, the gap between intended safeguards and real-world adversarial adaptability continues to widen.
No Comment! Be the first one.