SOB benchmarks 21 LLMs on actual field-value extraction accuracy across text, image, and audio, not just whether JSON parses.
Key Takeaways
Every frontier model clears JSON parsing at 97%+, but Value Accuracy sits 15-30 points lower – that gap is where structured output benchmarks have been misleading builders.
Qwen3.5-35B and GLM-4.7 beat GPT-5 and Claude-Sonnet-4.6 on Value Accuracy; model size does not predict extraction quality.
Audio is the hardest modality by far: best Value Accuracy is 23.7% (Gemini-2.5-Flash) versus 83.0% on text (GLM-4.7), even after normalizing to text context before scoring.
No single model dominates all three modalities; GLM-4.7 leads text, Gemma-4-31B leads images, Gemini-2.5-Flash leads audio.
Perfect Response rate collapses to roughly 50% even for top models; Path Recall and Type Safety can read 99% while 20-30% of leaf values are still wrong.