Introducing gpt-oss-safeguard

https://openai.com/index/introducing-gpt-oss-safeguard/
  • OpenAI releases open-weight safety classifier: 120B and 20B params.
    • Apache 2.0, downloadable from Hugging Face.
    • Fine-tuned versions of gpt-oss open models.
  • Chain-of-thought reasoning classifies content against developer-specified policies.
    • Policy supplied at inference time, not baked into training.
  • Beats GPT-5-thinking on internal multi-policy evals.
    • Underperforms dedicated classifiers trained on 10K+ labeled samples.
  • Best fit: emerging harms, nuanced domains, low labeled-data regimes.
    • Also preferred when explainability > latency.
  • High compute cost limits scale; not a drop-in for bulk content moderation.
  • ROOST community partnership for open safety model ecosystem.
    • Early testers: SafetyKit, Tomoro, Discord specialists.

Reddit

  • r/accelerate: “Trusted access for the next era of cyber defense | OpenAI” (20 pts, 1 comment)

· ** · Read on openai.com


Type Link
Added Apr 21, 2026