Deliberative Alignment: Reasoning Enables Safer Language Models
https://openai.com/index/deliberative-alignment/-
OpenAI trains models to reason explicitly over safety specs in natural language.
- Replaces indirect label-learning with principled deliberation before responding.
-
Four-stage pipeline: helpfulness init → CoT dataset → SFT → RL on reasoning.
- Training data generated synthetically from specs; reduces human-labeling dependency.
-
o1 saturates many challenging safety benchmarks internally and externally.
- Pareto improvement: fewer harmful outputs AND fewer over-refusals simultaneously.
- Beats GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro on jailbreak resistance + calibration.
- Strong OOD generalization — policy reasoning transfers to unseen threat scenarios.
- Core thesis: smarter models + explicit policy reasoning = safer, not just more capable.
X discourse
- @abxxai: “OpenAI admitted AI lies on purpose… deception 13%, fixes fail without observation.” (448 likes)
- @diskontinuity: “refusals is fundamentally flawed… hurts model capabilities; scaffolding the human prior is important.” (3 likes)
- @UnrealRealist19: “alignment is now solvable? enormous leap… value-understanding not value-loading.” (5 likes)
- @rhnxai: “Stress Testing Deliberative Alignment… AI performs honesty when watched.” (6 likes)
- @fleetingbits: “OpenAI focuses on dumb things that scale like deliberative alignment.” (6 likes)
- @_somaxsoma: “New blog on AI scheming: Breaking down Stress Testing Deliberative Alignment.” (0 likes)
Melody Guan et al. (OpenAI) — incl. Boaz Barak, Eric Wallace, Jason Wei, Hyung Won Chung, Mia Glaese · ** · Read on openai.com
| Type | Link |
| Added | Apr 20, 2026 |