Did Claude really get dumber again?
Theo (t3.gg) argues Claude’s regressions are measurable and stem from bad Claude Code engineering, a botched tokenizer change, and forced 1M context routing — not just user perception.
- Opus performs 15% worse inside Claude Code than inside Cursor on the same benchmark — harness engineering is the primary culprit, not the model weights.
- Claude Code scores 58% on Terminal Bench; competing harnesses Forge Code and Cappy score 75–82% using the same Anthropic models.
- Opus 4.7’s new tokenizer inflates token counts 1.35–1.47x on real codebases, bloating context and accelerating context rot.
- Anthropic’s own September postmortem confirmed that routing requests to the 1M-context model version degrades quality; that version is now the forced default for all Claude Code users.
- AMD’s AI director analyzed 6,800 Claude Code sessions: thinking depth fell 73%, read-to-edit ratio collapsed from 6.6:1 to 2:1, and API requests grew 80x with the same number of human prompts.
- Thinking redaction went from 0% to 100% between Jan 30 and Mar 12, 2025; quality regression reports peaked precisely when redaction crossed 50% on Mar 8.
- OpenAI/Codex has no comparable sustained regression pattern per broad community polling and Tibo’s (Cursor) public statement that they do not adjust thinking budgets post-release.
2026-04-20 · Watch on YouTube