What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench
Peter Gostev of Arena.ai presents data showing LLMs accept nonsense 50% of the time and users still dislike 9% of top-model responses despite benchmark progress.
- Arena’s ‘dislike both’ mechanic reveals 9% dissatisfaction rate for top-25 models in 2026, down from 17% pre-reasoning but far from zero.
- BullshitBench: GPT and Gemini models accept nonsense questions ~50% of the time; Claude models (since Sonnet 3.5) push back most reliably.
- Extended reasoning (thinking mode) makes nonsense-rejection worse, not better — models spend 20 paragraphs solving problems they briefly flag as invalid.
- Math dissatisfaction dropped dramatically; creative writing, law, finance, and gaming show little improvement over the same period.
- Gaming is a persistent weak spot: LLMs still cannot design coherent game mechanics even as general benchmarks climb.
- Arena has tracked 700+ models since Q2 2023 with 5.5M+ votes — the only benchmark that cannot be exhausted because one model is always better.
- User prompts have shifted to harder tasks over three years, so rising dissatisfaction in some categories reflects rising expectations, not model regression.
2026-04-24 · Watch on YouTube