arXiv paper finds frontier LLMs (Gemini, Claude, GPT) corrupt ~25% of document content in long delegated workflows across 52 professional domains.
Key Takeaways
DELEGATE-52 benchmarks 19 LLMs on long editing workflows spanning coding, crystallography, and music notation; all models degrade over time.
Errors are not incremental – models hold near-perfect fidelity then catastrophically lose 10-30+ points in a single round trip.
Weaker models fail via content deletion; frontier models fail via content corruption, a more insidious pattern.
Agentic tool use (file read/write, code execution) does not improve DELEGATE-52 scores.
Degradation worsens with document size, interaction length, and presence of distractor files.
Hacker News Comment Review
Commenters broadly agree long-context round-tripping is a known footgun, but flag that most non-technical LLM users have no awareness of this failure mode – making the org-wide deployment risk larger than practitioners assume.
The JPEG-compression analogy recurs: each LLM pass erodes precision and nuance, compounding silently; terms like “semantic ablation” and “meanwit reversion” capture the effect.
Practical mitigation discussed: treat document writing as a final rendering pass only, store knowledge as composable markdown fact-files with front-matter, and minimize LLM round trips by using it as a thin translation layer into deterministic processes.
Notable Comments
@buffaloPizzaBoy: store agent findings as independent markdown files with front-matter; use LLM only for final document rendering to avoid compounding corruption.
@simonw: skeptical of tool-use results given the “basic agentic harness” implementation – a more robust setup might show different outcomes.