Protecting the Wellbeing of Our Users

Claude hits 98.6–99.3% correct response rate on suicide/self-harm scenarios.
- Measured across Opus, Sonnet, Haiku 4.5 in high-risk situations.
Automated classifier surfaces crisis banners linking to helplines in 170+ countries.
- Partnership with ThroughLine; IASP consulted for methodology.
Sycophancy reduced 70–85% vs prior models via new behavioral audits.
- Risk: AI validating delusions is a real harm vector, not just annoyance.
Claude.ai bans under-18s; classifiers detect self-disclosed age or subtle cues.
Evals combine single-turn, multi-turn, and real-user-conversation stress tests.
- Human spot-checks validate automated scoring.

X discourse

@thdxr: “we try to make things work out of the box well for typical users which sometimes conflicts with what they want” (567 likes)
@kexicheng: “Anthropic’s warning system flags without explanation, causing users to self-censor; no appeals, undefined ‘harmful content’” (622 likes)
@ilysmdonnyyy: “Frustrating how a statement to promote safe fandom was taken out of context, turned into misleading narratives.” (506 likes)
@protectosion: “Addressing invasions of privacy is a basic right; hate directed at him has gone too far.” (572 likes)

Anthropic (no individual author) · ** · Read on anthropic.com

Type	Link
Added	Apr 17, 2026