I Built TetrisBench, Where LLMs Compete at Playing Tetris. Here's What I Found.

https://a16z.com/i-built-tetrisbench-where-llms-compete-at-playing-tetris-heres-what-i-found/
  • Direct board-state reasoning failed — models made nonsensical moves.
    • JSON-encoded board input → inconsistent, incoherent decisions across all tested frontier models.
  • Fix: make LLMs write scoring functions, not pick moves directly.
    • Code-generation framing stable where direct-move framing collapsed.
  • Gemini 3 Pro led at 62% win rate, 109 pts/move; Flash close at 60.3%.
    • Fewer strategy updates per game correlated with higher performance.
  • Optimization horizon emerges from behavior — prompting can’t reliably elicit it.
  • Top human (TAFOKINTS) beat Claude Opus: 22,300 vs 15,700 points.
    • Exploited “controlled chaos” — bumpy boards (12–19) with minimal holes.
    • Models broke on off-distribution board states outside training distribution.

Yoko Li (a16z Partner, developer tools & AI infrastructure) · 2026-02-23 · Read on a16z.com


Type Link
Added Feb 23, 2026
Modified Apr 15, 2026