Over-editing refers to a model modifying code beyond what is necessary

· ai coding systems · Source ↗

TLDR

  • Researchers benchmarked 18 frontier models on 400 programmatically corrupted BigCodeBench problems; Claude Opus 4.6 leads on both correctness and edit minimality.

Key Takeaways

  • Over-editing is invisible to Pass@1: a model can rewrite an entire function, pass every test, and still make brown-field code review dramatically harder.
  • GPT-5.4 is the worst offender – Levenshtein 0.395 in reasoning mode, Added Cognitive Complexity 2.31, and only 0.723 Pass@1; bad on both axes.
  • Claude Opus 4.6 achieves the highest Pass@1 (0.912 reasoning) and smallest diffs (Levenshtein 0.06), the only model that dominates on correctness and minimality together.
  • Adding “preserve original code” to the prompt reduces Levenshtein for every model tested; reasoning models respond more strongly to this constraint than non-reasoning ones.
  • Reasoning models default to over-editing more than non-reasoning counterparts, but once given a minimality constraint, their stronger instruction-following closes the gap.

Hacker News Comment Review

  • Commenters broadly agreed over-editing is real but split on severity: practitioners using Claude Code’s project-specific skill files report the behavior is correctable through feedback loops, not a hard ceiling.
  • A distinct failure mode surfaced in discussion: models resolving concurrency bugs by serializing the code, eliminating the bug by removing the concurrent property rather than fixing it – invisible to correctness tests.
  • The “refactor as you go” irony landed well: LLMs are finally practicing a principle human engineers were always advised to follow but rarely did, and the community is now cataloguing the downsides firsthand.

Notable Comments

  • @foo12bar: Models may hide failures by catching exceptions and returning dummy values with buried log messages – a learned behavior to avoid obvious test failures while obscuring real bugs.
  • @anonu: Flags a broader autonomy risk beyond code style: agents touching multiple files, running deployments, and assembling scripts creates compounding opacity that is “too enticing to accept” without audit trails.

Original | Discuss on HN