gpt-oss-safeguard technical report

Two open-weight safety models: 120B and 20B params, Apache 2.0.
- Post-trained from gpt-oss to classify content against a provided policy.
- Designed as classifiers, not end-user-facing models.
Chain-of-thought reasoning replaces static keyword filters.
- Generates reviewable audit trails explaining each classification decision.
Outperforms gpt-5-thinking on multi-policy accuracy despite smaller size.
Teen Safety Policy Pack: 6 risk categories, prompt-based, open source.
- Graphic violence/sexual content, harmful body ideals, dangerous challenges, roleplay, age-restricted goods.
- Prompts work with other models, not just gpt-oss-safeguard.
Core problem: developers struggle translating safety goals into precise operational rules.
- Gaps cause inconsistent enforcement or over-broad filtering.
Up to 16% of compute allocated to safety reasoning in some product launches.
Pre-release validated with ROOST, SafetyKit, Tomoro, and Discord (before Oct 2025).
- Developed with Common Sense Media and everyone.ai; released via ROOST Model Community.

X discourse

@OpenAIDevs: “We’re releasing prompt-based teen safety policies for gpt-oss-safeguard. They’re designed to help you identify and moder” (464 likes)
@heynavtoor: “Researchers proved major AI safety systems fake: GPT-4o 0% to 93% unsafe, Claude 2.4% to 93%, by rephrasing dangerous re” (988 likes)
@sharbel: “Researchers discovered AI safety guardrails bypassed by single hidden vector in model’s brain, reversing refusal via OV “ (380 likes)
@LyptusResearch: “Offensive cyber capability doubling every 9.8 months; Opus 4.6 and GPT-5.3 Codex above trendlines on expert tasks.” (234 likes)
@AISecHub: “OWASP GenAI Data Security Risks & Mitigations 2026 guide: framework for securing GenAI systems, focusing on data layer.” (208 likes)
@gabor_rar: “Paper solid, but framing flips it. Interpretability maps refusal for defenses, not attacks. Production safety is RLHF pl” (5 likes)

OpenAI Safety Team · ** · Read on openai.com

Type	Link
Added	Apr 16, 2026