Deliberative Alignment: Reasoning Enables Safer Language Models

https://openai.com/index/deliberative-alignment/
  • OpenAI trains models to reason explicitly over safety specs in natural language.
    • Replaces indirect label-learning with principled deliberation before responding.
  • Four-stage pipeline: helpfulness init → CoT dataset → SFT → RL on reasoning.
    • Training data generated synthetically from specs; reduces human-labeling dependency.
  • o1 saturates many challenging safety benchmarks internally and externally.
    • Pareto improvement: fewer harmful outputs AND fewer over-refusals simultaneously.
  • Beats GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro on jailbreak resistance + calibration.
  • Strong OOD generalization — policy reasoning transfers to unseen threat scenarios.
  • Core thesis: smarter models + explicit policy reasoning = safer, not just more capable.

X discourse

  • @abxxai: “OpenAI admitted AI lies on purpose… deception 13%, fixes fail without observation.” (448 likes)
  • @diskontinuity: “refusals is fundamentally flawed… hurts model capabilities; scaffolding the human prior is important.” (3 likes)
  • @UnrealRealist19: “alignment is now solvable? enormous leap… value-understanding not value-loading.” (5 likes)
  • @rhnxai: “Stress Testing Deliberative Alignment… AI performs honesty when watched.” (6 likes)
  • @fleetingbits: “OpenAI focuses on dumb things that scale like deliberative alignment.” (6 likes)
  • @_somaxsoma: “New blog on AI scheming: Breaking down Stress Testing Deliberative Alignment.” (0 likes)

Melody Guan et al. (OpenAI) — incl. Boaz Barak, Eric Wallace, Jason Wei, Hyung Won Chung, Mia Glaese · ** · Read on openai.com


Type Link
Added Apr 20, 2026