Robust Reasoning Benchmark: How Formatting Changes Break LLM Math Reasoning

Summary

Shows that LLMs’ math reasoning is brittle — small formatting perturbations (14 different techniques applied to AIME 2024) cause significant performance drops. Frontier models are more resilient, but open-weight reasoning models are highly sensitive to superficial text changes. Useful for anyone evaluating whether an LLM truly “reasons” or just pattern-matches on familiar formatting.

Categories: cs.CL, cs.AI

Read paper

Type	Link
Added	Apr 13, 2026
Modified	Apr 13, 2026

📄 Papers 8 items