What Do Models Still Suck At? - Peter Gostev, Arena.ai, BullshitBench

· media ai · Source ↗

Summary based on the YouTube transcript and episode description.

Peter Gostev of Arena.ai presents data showing LLMs accept nonsense 50% of the time and users still dislike 9% of top-model responses despite benchmark progress.

  • Arena’s ‘dislike both’ mechanic reveals 9% dissatisfaction rate for top-25 models in 2026, down from 17% pre-reasoning but far from zero.
  • BullshitBench: GPT and Gemini models accept nonsense questions ~50% of the time; Claude models (since Sonnet 3.5) push back most reliably.
  • Extended reasoning (thinking mode) makes nonsense-rejection worse, not better — models spend 20 paragraphs solving problems they briefly flag as invalid.
  • Math dissatisfaction dropped dramatically; creative writing, law, finance, and gaming show little improvement over the same period.
  • Gaming is a persistent weak spot: LLMs still cannot design coherent game mechanics even as general benchmarks climb.
  • Arena has tracked 700+ models since Q2 2023 with 5.5M+ votes — the only benchmark that cannot be exhausted because one model is always better.
  • User prompts have shifted to harder tasks over three years, so rising dissatisfaction in some categories reflects rising expectations, not model regression.

2026-04-24 · Watch on YouTube