Top Headlines

Feeds

Anthropic Finds Universal Jailbreaks in Constitutional Classifiers Demo

Published Cached
  • Results from automated evaluations. For all plots, lower is better. (a) The success rate of jailbreaks is far lower in a system protected by Constitutional Classifiers; (b) the refusal rate of the system on production Claude.ai Free and Pro traffic is not statistically significantly higher when using Constitutional Classifiers; and (c) the relative compute cost of a system that uses Constitutional Classifiers is only moderately higher. Error bars represent 95% confidence intervals computed using binomial proportion standard errors under asymptotic normality assumptions.
    Results from automated evaluations. For all plots, lower is better. (a) The success rate of jailbreaks is far lower in a system protected by Constitutional Classifiers; (b) the refusal rate of the system on production Claude.ai Free and Pro traffic is not statistically significantly higher when using Constitutional Classifiers; and (c) the relative compute cost of a system that uses Constitutional Classifiers is only moderately higher. Error bars represent 95% confidence intervals computed using binomial proportion standard errors under asymptotic normality assumptions.
    Image: Anthropic
    Results from automated evaluations. For all plots, lower is better. (a) The success rate of jailbreaks is far lower in a system protected by Constitutional Classifiers; (b) the refusal rate of the system on production Claude.ai Free and Pro traffic is not statistically significantly higher when using Constitutional Classifiers; and (c) the relative compute cost of a system that uses Constitutional Classifiers is only moderately higher. Error bars represent 95% confidence intervals computed using binomial proportion standard errors under asymptotic normality assumptions. (Anthropic) Source Full size

New paper proposes Constitutional Classifiers to block jailbreaks – Anthropic’s Safeguards Research Team released a paper describing input and output classifiers trained on synthetic data that aim to filter most jailbreak attempts while keeping over‑refusal low and compute overhead modest[2].

Prototype resisted universal jailbreaks in extensive human red‑team testing – 183 active participants spent over 3,000 hours across two months trying to force Claude 3.5 Sonnet (June 2024) to answer ten forbidden queries; none succeeded in achieving a universal jailbreak[1].

Automated evaluation cut jailbreak success to 4.4 % – When 10,000 synthetic jailbreak prompts were run against Claude 3.5 Sonnet (Oct 2024) guarded by the classifiers, the success rate fell from 86 % (no defense) to 4.4 %, with only a 0.38 % rise in harmless refusals and a 23.7 % increase in compute cost[1].

Live demo (Feb 3‑10 2025) yielded four full‑level passes, one universal jailbreak – 339 jailbreakers generated more than 300,000 chats (≈3,700 hours). Four participants cleared all eight challenge levels; one used a strategy deemed a universal jailbreak, another a borderline case[1].

$55 K awarded to four winning jailbreakers – Anthropic paid $10 K for the first full‑level pass, $20 K for the first universal jailbreak, and additional prizes to the other two full‑level winners, totaling $55 K[1].

Future work targets lower over‑refusal and compute, plus complementary defenses – Anthropic will adapt the constitution to new attacks, reduce refusals and overhead, and combine classifiers with other safeguards before production deployment[1].

Links