Introducing GPT-5.5

· coding ai ai-agents · Source ↗

TLDR

  • GPT-5.5 hits SOTA on Terminal-Bench 2.0 (82.7%), SWE-Bench Pro (58.6%), and OSWorld-Verified (78.7%) while matching GPT-5.4 per-token latency.

Key Takeaways

  • Matches GPT-5.4 per-token latency despite higher capability; uses fewer tokens on Codex tasks, cutting cost alongside intelligence gains.
  • Terminal-Bench 2.0 SOTA at 82.7%; SWE-Bench Pro 58.6%; Expert-SWE 73.1% on internal long-horizon tasks with a median 20-hour human completion estimate.
  • Available today to ChatGPT Plus, Pro, Business, Enterprise, and Codex; API access coming very soon with additional safety requirements for scale.
  • GPT-5.5 Pro targets demanding knowledge work: GDPval 84.9%, OSWorld-Verified 78.7%, Tau2-bench Telecom 98.0% without prompt tuning.
  • Scientific capability gains: GeneBench and BixBench leading scores; a custom GPT-5.5 harness helped produce a new proof about Ramsey numbers.

Hacker News Comment Review

  • Skeptics noted the release landed shortly after Claude Opus 4.7, with benchmark selection that happens to favor GPT-5.5; “our smartest model yet” framing drew predictable eye-rolls.
  • The detail that Codex analyzed its own production traffic and wrote custom heuristics to boost GPU token throughput by 20% attracted more genuine technical interest than the benchmark table.
  • Practitioners are cautiously optimistic on token efficiency: Opus 4.7’s gains came by using more tokens, not fewer, so GPT-5.5’s opposite trajectory stands out if it holds outside evals.

Notable Comments

  • @tedsanders: rollout is gradual over many hours, Pro and Enterprise accounts first, then Plus; access may not be live on launch day.
  • @astlouis44: shared a playable 3D dungeon arena Codex built from a single prompt using TypeScript and Three.js, with GPT-generated environment textures – one of the more concrete capability demos.

Original | Discuss on HN