Introducing gpt-oss-safeguard

OpenAI releases open-weight safety classifier: 120B and 20B params.
- Apache 2.0, downloadable from Hugging Face.
- Fine-tuned versions of gpt-oss open models.
Chain-of-thought reasoning classifies content against developer-specified policies.
- Policy supplied at inference time, not baked into training.
Beats GPT-5-thinking on internal multi-policy evals.
- Underperforms dedicated classifiers trained on 10K+ labeled samples.
Best fit: emerging harms, nuanced domains, low labeled-data regimes.
- Also preferred when explainability > latency.
High compute cost limits scale; not a drop-in for bulk content moderation.
ROOST community partnership for open safety model ecosystem.
- Early testers: SafetyKit, Tomoro, Discord specialists.

r/accelerate: “Trusted access for the next era of cyber defense | OpenAI” (20 pts, 1 comment)

Type	Link
Added	Apr 21, 2026

Reddit