What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

Name: What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench
Uploaded: 2026-04-24T12:00:00.000000Z
Description: Peter Gostev of Arena.ai presents data showing LLMs accept nonsense 50% of the time and users still dislike 9% of top-model responses despite benchmark progress. Arena’s ‘dislike both’ mechanic reveal…

Apr 24, 2026 · media ai · Source ↗

Summary based on the YouTube transcript and episode description.

Peter Gostev of Arena.ai presents data showing LLMs accept nonsense 50% of the time and users still dislike 9% of top-model responses despite benchmark progress.

Arena’s ‘dislike both’ mechanic reveals 9% dissatisfaction rate for top-25 models in 2026, down from 17% pre-reasoning but far from zero.
BullshitBench: GPT and Gemini models accept nonsense questions ~50% of the time; Claude models (since Sonnet 3.5) push back most reliably.
Extended reasoning (thinking mode) makes nonsense-rejection worse, not better — models spend 20 paragraphs solving problems they briefly flag as invalid.
Math dissatisfaction dropped dramatically; creative writing, law, finance, and gaming show little improvement over the same period.
Gaming is a persistent weak spot: LLMs still cannot design coherent game mechanics even as general benchmarks climb.
Arena has tracked 700+ models since Q2 2023 with 5.5M+ votes — the only benchmark that cannot be exhausted because one model is always better.
User prompts have shifted to harder tasks over three years, so rising dissatisfaction in some categories reflects rising expectations, not model regression.

2026-04-24 · Watch on YouTube