Protecting the Wellbeing of Our Users

https://www.anthropic.com/news/protecting-well-being-of-users
  • Claude hits 98.6–99.3% correct response rate on suicide/self-harm scenarios.
    • Measured across Opus, Sonnet, Haiku 4.5 in high-risk situations.
  • Automated classifier surfaces crisis banners linking to helplines in 170+ countries.
    • Partnership with ThroughLine; IASP consulted for methodology.
  • Sycophancy reduced 70–85% vs prior models via new behavioral audits.
    • Risk: AI validating delusions is a real harm vector, not just annoyance.
  • Claude.ai bans under-18s; classifiers detect self-disclosed age or subtle cues.
  • Evals combine single-turn, multi-turn, and real-user-conversation stress tests.
    • Human spot-checks validate automated scoring.

X discourse

  • @thdxr: “we try to make things work out of the box well for typical users which sometimes conflicts with what they want” (567 likes)
  • @kexicheng: “Anthropic’s warning system flags without explanation, causing users to self-censor; no appeals, undefined ‘harmful content’” (622 likes)
  • @ilysmdonnyyy: “Frustrating how a statement to promote safe fandom was taken out of context, turned into misleading narratives.” (506 likes)
  • @protectosion: “Addressing invasions of privacy is a basic right; hate directed at him has gone too far.” (572 likes)

Anthropic (no individual author) · ** · Read on anthropic.com


Type Link
Added Apr 17, 2026