Anthropic Finds Universal Jailbreaks in Constitutional Classifiers Demo

Published 2025-02-03T00:00:00-0800 Cached 2026-03-11T21:15:24+0000

Image: Anthropic

Results from automated evaluations. For all plots, lower is better. (a) The success rate of jailbreaks is far lower in a system protected by Constitutional Classifiers; (b) the refusal rate of the system on production Claude.ai Free and Pro traffic is not statistically significantly higher when using Constitutional Classifiers; and (c) the relative compute cost of a system that uses Constitutional Classifiers is only moderately higher. Error bars represent 95% confidence intervals computed using binomial proportion standard errors under asymptotic normality assumptions. (Anthropic) Source Full size

New paper proposes Constitutional Classifiers to block jailbreaks – Anthropic’s Safeguards Research Team released a paper describing input and output classifiers trained on synthetic data that aim to filter most jailbreak attempts while keeping over‑refusal low and compute overhead modest[2].

Prototype resisted universal jailbreaks in extensive human red‑team testing – 183 active participants spent over 3,000 hours across two months trying to force Claude 3.5 Sonnet (June 2024) to answer ten forbidden queries; none succeeded in achieving a universal jailbreak[1].

Automated evaluation cut jailbreak success to 4.4 % – When 10,000 synthetic jailbreak prompts were run against Claude 3.5 Sonnet (Oct 2024) guarded by the classifiers, the success rate fell from 86 % (no defense) to 4.4 %, with only a 0.38 % rise in harmless refusals and a 23.7 % increase in compute cost[1].

Live demo (Feb 3‑10 2025) yielded four full‑level passes, one universal jailbreak – 339 jailbreakers generated more than 300,000 chats (≈3,700 hours). Four participants cleared all eight challenge levels; one used a strategy deemed a universal jailbreak, another a borderline case[1].

$55 K awarded to four winning jailbreakers – Anthropic paid $10 K for the first full‑level pass, $20 K for the first universal jailbreak, and additional prizes to the other two full‑level winners, totaling $55 K[1].

Future work targets lower over‑refusal and compute, plus complementary defenses – Anthropic will adapt the constitution to new attacks, reduce refusals and overhead, and combine classifiers with other safeguards before production deployment[1].

Top Headlines

Feeds

Anthropic Finds Universal Jailbreaks in Constitutional Classifiers Demo

Published 2025-02-03T00:00:00-0800 Cached 2026-03-11T21:15:24+0000

Links