I Built TetrisBench, Where LLMs Compete at Playing Tetris. Here's What I Found.
https://a16z.com/i-built-tetrisbench-where-llms-compete-at-playing-tetris-heres-what-i-found/-
Direct board-state reasoning failed — models made nonsensical moves.
- JSON-encoded board input → inconsistent, incoherent decisions across all tested frontier models.
-
Fix: make LLMs write scoring functions, not pick moves directly.
- Code-generation framing stable where direct-move framing collapsed.
-
Gemini 3 Pro led at 62% win rate, 109 pts/move; Flash close at 60.3%.
- Fewer strategy updates per game correlated with higher performance.
- Optimization horizon emerges from behavior — prompting can’t reliably elicit it.
-
Top human (TAFOKINTS) beat Claude Opus: 22,300 vs 15,700 points.
- Exploited “controlled chaos” — bumpy boards (12–19) with minimal holes.
- Models broke on off-distribution board states outside training distribution.
Yoko Li (a16z Partner, developer tools & AI infrastructure) · 2026-02-23 · Read on a16z.com
| Type | Link |
| Added | Feb 23, 2026 |
| Modified | Apr 15, 2026 |