Robust Reasoning Benchmark: How Formatting Changes Break LLM Math Reasoning
https://arxiv.org/abs/2604.08571Summary
Shows that LLMs’ math reasoning is brittle — small formatting perturbations (14 different techniques applied to AIME 2024) cause significant performance drops. Frontier models are more resilient, but open-weight reasoning models are highly sensitive to superficial text changes. Useful for anyone evaluating whether an LLM truly “reasons” or just pattern-matches on familiar formatting.
Categories: cs.CL, cs.AI
| Type | Link |
| Added | Apr 13, 2026 |
| Modified | Apr 13, 2026 |