arXiv paper finds frontier LLMs corrupt ~25% of document content in long delegated workflows across 52 professional domains.
Key Takeaways
DELEGATE-52 benchmark tests 19 LLMs on sustained document editing across domains including coding, crystallography, and music notation.
Even top models (Gemini 2.5 Pro, Claude 3.7 Opus, GPT-4.5) hit ~25% corruption rates; weaker models fail worse.
Errors are sparse but severe and compound silently over long interactions – the core danger for agentic pipelines.
Agentic tool use does not improve DELEGATE-52 scores; larger documents, longer sessions, and distractor files all worsen degradation.
No current model is a reliable autonomous delegate for high-fidelity document work.
Hacker News Comment Review
Commenters recognize the compounding problem from practice: context length growth correlates with error accumulation, matching the paper’s degradation-by-interaction-length finding.
The “semantic ablation” framing – repeated LLM passes eroding meaning – is seen as a real systemic risk beyond single-session use cases.
Notable Comments
@jonmoore: Flags the round-trip invertible-step evaluation method as strong; asks whether Python’s better results generalize to other languages or are benchmark artifacts.