Protecting the Wellbeing of Our Users
https://www.anthropic.com/news/protecting-well-being-of-users-
Claude hits 98.6–99.3% correct response rate on suicide/self-harm scenarios.
- Measured across Opus, Sonnet, Haiku 4.5 in high-risk situations.
-
Automated classifier surfaces crisis banners linking to helplines in 170+ countries.
- Partnership with ThroughLine; IASP consulted for methodology.
-
Sycophancy reduced 70–85% vs prior models via new behavioral audits.
- Risk: AI validating delusions is a real harm vector, not just annoyance.
- Claude.ai bans under-18s; classifiers detect self-disclosed age or subtle cues.
-
Evals combine single-turn, multi-turn, and real-user-conversation stress tests.
- Human spot-checks validate automated scoring.
X discourse
- @thdxr: “we try to make things work out of the box well for typical users which sometimes conflicts with what they want” (567 likes)
- @kexicheng: “Anthropic’s warning system flags without explanation, causing users to self-censor; no appeals, undefined ‘harmful content’” (622 likes)
- @ilysmdonnyyy: “Frustrating how a statement to promote safe fandom was taken out of context, turned into misleading narratives.” (506 likes)
- @protectosion: “Addressing invasions of privacy is a basic right; hate directed at him has gone too far.” (572 likes)
Anthropic (no individual author) · ** · Read on anthropic.com
| Type | Link |
| Added | Apr 17, 2026 |