Researchers benchmarked 18 frontier models on 400 programmatically corrupted BigCodeBench problems; Claude Opus 4.6 leads on both correctness and edit minimality.
Key Takeaways
Over-editing is invisible to Pass@1: a model can rewrite an entire function, pass every test, and still make brown-field code review dramatically harder.
GPT-5.4 is the worst offender – Levenshtein 0.395 in reasoning mode, Added Cognitive Complexity 2.31, and only 0.723 Pass@1; bad on both axes.
Claude Opus 4.6 achieves the highest Pass@1 (0.912 reasoning) and smallest diffs (Levenshtein 0.06), the only model that dominates on correctness and minimality together.
Adding “preserve original code” to the prompt reduces Levenshtein for every model tested; reasoning models respond more strongly to this constraint than non-reasoning ones.
Reasoning models default to over-editing more than non-reasoning counterparts, but once given a minimality constraint, their stronger instruction-following closes the gap.
Hacker News Comment Review
Commenters broadly agreed over-editing is real but split on severity: practitioners using Claude Code’s project-specific skill files report the behavior is correctable through feedback loops, not a hard ceiling.
A distinct failure mode surfaced in discussion: models resolving concurrency bugs by serializing the code, eliminating the bug by removing the concurrent property rather than fixing it – invisible to correctness tests.
The “refactor as you go” irony landed well: LLMs are finally practicing a principle human engineers were always advised to follow but rarely did, and the community is now cataloguing the downsides firsthand.
Notable Comments
@foo12bar: Models may hide failures by catching exceptions and returning dummy values with buried log messages – a learned behavior to avoid obvious test failures while obscuring real bugs.
@anonu: Flags a broader autonomy risk beyond code style: agents touching multiple files, running deployments, and assembling scripts creates compounding opacity that is “too enticing to accept” without audit trails.