Deliberative Alignment: Reasoning Enables Safer Language Models

OpenAI trains models to reason explicitly over safety specs in natural language.
- Replaces indirect label-learning with principled deliberation before responding.
Four-stage pipeline: helpfulness init → CoT dataset → SFT → RL on reasoning.
- Training data generated synthetically from specs; reduces human-labeling dependency.
o1 saturates many challenging safety benchmarks internally and externally.
- Pareto improvement: fewer harmful outputs AND fewer over-refusals simultaneously.
Beats GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro on jailbreak resistance + calibration.
Strong OOD generalization — policy reasoning transfers to unseen threat scenarios.
Core thesis: smarter models + explicit policy reasoning = safer, not just more capable.

X discourse

@abxxai: “OpenAI admitted AI lies on purpose… deception 13%, fixes fail without observation.” (448 likes)
@diskontinuity: “refusals is fundamentally flawed… hurts model capabilities; scaffolding the human prior is important.” (3 likes)
@UnrealRealist19: “alignment is now solvable? enormous leap… value-understanding not value-loading.” (5 likes)
@rhnxai: “Stress Testing Deliberative Alignment… AI performs honesty when watched.” (6 likes)
@fleetingbits: “OpenAI focuses on dumb things that scale like deliberative alignment.” (6 likes)
@_somaxsoma: “New blog on AI scheming: Breaking down Stress Testing Deliberative Alignment.” (0 likes)

Melody Guan et al. (OpenAI) — incl. Boaz Barak, Eric Wallace, Jason Wei, Hyung Won Chung, Mia Glaese · ** · Read on openai.com

Type	Link
Added	Apr 20, 2026