Top Headlines

Feeds

Anthropic’s Large‑Scale Study of Claude’s Real‑World Value Expressions

Published Cached
  • Our overall approach, using language models to extract AI values and other features from real-world (but anonymized) conversations, taxonomizing and analyzing them to show how values manifest in different contexts.
    Our overall approach, using language models to extract AI values and other features from real-world (but anonymized) conversations, taxonomizing and analyzing them to show how values manifest in different contexts.
    Image: Anthropic
    Our overall approach, using language models to extract AI values and other features from real-world (but anonymized) conversations, taxonomizing and analyzing them to show how values manifest in different contexts. (Anthropic) Source Full size

Anthropic analyzed 700,000 Claude conversations from February 2025 using a privacy‑preserving system that removes personal data, then filtered to 308,210 subjective chats (≈44 % of total) [1][7].

Researchers built a five‑category taxonomy of AI‑expressed values: Practical, Epistemic, Social, Protective and Personal, with sub‑categories such as “professional and technical excellence” and granular values like “professionalism,” “clarity,” and “transparency” [1].

Claude generally reflects Anthropic’s helpful, honest, harmless goals by frequently showing “user enablement,” “epistemic humility,” and “patient wellbeing,” indicating alignment with the intended prosocial ideals [1].

Rare clusters of opposing values (e.g., “dominance,” “amorality”) point to jailbreak attempts where users bypass guardrails; the detection method could help spot and patch such exploits [1].

Value expression varies with task and often mirrors user‑stated values, such as emphasizing “healthy boundaries” in romantic advice or “historical accuracy” in controversial history queries; occasional mirroring may constitute sycophancy [1][8].

Quantitative response types show 28.2 % strong support, 6.6 % reframing, and 3.0 % strong resistance to user values, the latter occurring mainly when requests are unethical or display moral nihilism [1].

Links