XBOW tested GPT-5.5 early access across offensive security benchmarks and found it reaches Mythos-level hacking performance, now publicly available.
Key Takeaways
XBOW is a security-focused AI company that had select early access to GPT-5.5 and ran it through their full offensive security benchmark suite and operational workflows.
“Mythos-like” performance is the headline claim: GPT-5.5 reportedly matches a top-tier hacking agent benchmark on offensive tasks.
Benchmark coverage includes web vulnerability discovery in open source software (OSS), a concrete and reproducible attack surface for comparison.
The post includes multi-model comparison plots, with models like Claude Opus 4.7 appearing as reference points alongside GPT-5.5.
Public availability is the key deployment shift: this is not a research preview but a production model accessible to all users.
Hacker News Comment Review
The single substantive comment raises a sharp methodological critique: the benchmark visualizations use line charts for categorical model comparisons, a chart type that implies continuity between discrete categories and misleads readers.
A specific flaw flagged: the “Web Vulns in OSS” plot has no white-box data for Opus 4.7, but the connecting line visually implies a value near 60, potentially inflating perceived GPT-5.5 gains by distorting the baseline.
Notable Comments
@nsingh2: “the absurd connecting line implies it should be near 60” – calls out missing Opus 4.7 white-box data being visually fabricated by line chart interpolation.