gpt-oss-safeguard technical report
https://openai.com/index/gpt-oss-safeguard-technical-report/-
Two open-weight safety models: 120B and 20B params, Apache 2.0.
- Post-trained from gpt-oss to classify content against a provided policy.
- Designed as classifiers, not end-user-facing models.
-
Chain-of-thought reasoning replaces static keyword filters.
- Generates reviewable audit trails explaining each classification decision.
- Outperforms gpt-5-thinking on multi-policy accuracy despite smaller size.
-
Teen Safety Policy Pack: 6 risk categories, prompt-based, open source.
- Graphic violence/sexual content, harmful body ideals, dangerous challenges, roleplay, age-restricted goods.
- Prompts work with other models, not just gpt-oss-safeguard.
-
Core problem: developers struggle translating safety goals into precise operational rules.
- Gaps cause inconsistent enforcement or over-broad filtering.
- Up to 16% of compute allocated to safety reasoning in some product launches.
-
Pre-release validated with ROOST, SafetyKit, Tomoro, and Discord (before Oct 2025).
- Developed with Common Sense Media and everyone.ai; released via ROOST Model Community.
X discourse
- @OpenAIDevs: “We’re releasing prompt-based teen safety policies for gpt-oss-safeguard. They’re designed to help you identify and moder” (464 likes)
- @heynavtoor: “Researchers proved major AI safety systems fake: GPT-4o 0% to 93% unsafe, Claude 2.4% to 93%, by rephrasing dangerous re” (988 likes)
- @sharbel: “Researchers discovered AI safety guardrails bypassed by single hidden vector in model’s brain, reversing refusal via OV “ (380 likes)
- @LyptusResearch: “Offensive cyber capability doubling every 9.8 months; Opus 4.6 and GPT-5.3 Codex above trendlines on expert tasks.” (234 likes)
- @AISecHub: “OWASP GenAI Data Security Risks & Mitigations 2026 guide: framework for securing GenAI systems, focusing on data layer.” (208 likes)
- @gabor_rar: “Paper solid, but framing flips it. Interpretability maps refusal for defenses, not attacks. Production safety is RLHF pl” (5 likes)
OpenAI Safety Team · ** · Read on openai.com
| Type | Link |
| Added | Apr 16, 2026 |