Robust Reasoning Benchmark: How Formatting Changes Break LLM Math Reasoning

https://arxiv.org/abs/2604.08571

Summary

Shows that LLMs’ math reasoning is brittle — small formatting perturbations (14 different techniques applied to AIME 2024) cause significant performance drops. Frontier models are more resilient, but open-weight reasoning models are highly sensitive to superficial text changes. Useful for anyone evaluating whether an LLM truly “reasons” or just pattern-matches on familiar formatting.

Categories: cs.CL, cs.AI

Read paper


Type Link
Added Apr 13, 2026
Modified Apr 13, 2026