This model is kind of a disaster.

· video · Source ↗

Summary based on the YouTube transcript and episode description.

Theo (t3.gg) tests Opus 4.7 for a full day and concludes its regressions come from Claude Code’s degraded harness, not the model itself.

  • Opus 4.7 scores worse than Opus 4.6 on Agentic Search bench and slightly worse on cybersecurity vulnerability reproduction.
  • Cyber safeguards are so aggressive the model hard-locked a Defcon cryptography puzzle chat, refusing to continue unless downgraded to Sonnet 4.
  • A malware-prevention system prompt leaked into normal Claude Code Desktop sessions, flagging Theo’s own personal site as malware.
  • Vision input limit raised to 2576px long edge (~4MP), roughly 3x previous Claude models.
  • Opus 4.7 never searched for latest package versions, repeatedly planning Next.js 15 upgrades despite Next.js 16 being available; GPT-5.4 fetched live docs and correctly targeted Next.js 16.
  • Theo’s core thesis: Anthropic engineers use a different internal stack than public Claude Code, so they ship models that perform well internally but land in a broken harness externally.
  • New Claude Code features shipped: X-high effort level, /ultrareview command, and auto-mode permission classifier — but auto mode broke the existing bypass-permissions flag during Theo’s testing.
  • Pricing unchanged from Opus 4.6: $5 per million input tokens, $25 per million output tokens.

2026-04-17 · Watch on YouTube