gpt-oss-safeguard technical report

https://openai.com/index/gpt-oss-safeguard-technical-report/
  • Two open-weight safety models: 120B and 20B params, Apache 2.0.
    • Post-trained from gpt-oss to classify content against a provided policy.
    • Designed as classifiers, not end-user-facing models.
  • Chain-of-thought reasoning replaces static keyword filters.
    • Generates reviewable audit trails explaining each classification decision.
  • Outperforms gpt-5-thinking on multi-policy accuracy despite smaller size.
  • Teen Safety Policy Pack: 6 risk categories, prompt-based, open source.
    • Graphic violence/sexual content, harmful body ideals, dangerous challenges, roleplay, age-restricted goods.
    • Prompts work with other models, not just gpt-oss-safeguard.
  • Core problem: developers struggle translating safety goals into precise operational rules.
    • Gaps cause inconsistent enforcement or over-broad filtering.
  • Up to 16% of compute allocated to safety reasoning in some product launches.
  • Pre-release validated with ROOST, SafetyKit, Tomoro, and Discord (before Oct 2025).
    • Developed with Common Sense Media and everyone.ai; released via ROOST Model Community.

X discourse

  • @OpenAIDevs: “We’re releasing prompt-based teen safety policies for gpt-oss-safeguard. They’re designed to help you identify and moder” (464 likes)
  • @heynavtoor: “Researchers proved major AI safety systems fake: GPT-4o 0% to 93% unsafe, Claude 2.4% to 93%, by rephrasing dangerous re” (988 likes)
  • @sharbel: “Researchers discovered AI safety guardrails bypassed by single hidden vector in model’s brain, reversing refusal via OV “ (380 likes)
  • @LyptusResearch: “Offensive cyber capability doubling every 9.8 months; Opus 4.6 and GPT-5.3 Codex above trendlines on expert tasks.” (234 likes)
  • @AISecHub: “OWASP GenAI Data Security Risks & Mitigations 2026 guide: framework for securing GenAI systems, focusing on data layer.” (208 likes)
  • @gabor_rar: “Paper solid, but framing flips it. Interpretability maps refusal for defenses, not attacks. Production safety is RLHF pl” (5 likes)

OpenAI Safety Team · ** · Read on openai.com


Type Link
Added Apr 16, 2026