Why SWE-bench Verified no longer measures frontier coding capabilities

· coding · Source ↗

TLDR

  • SWE-bench Verified is contaminated by flawed tests and training leakage, making it a poor signal for frontier coding progress.

Key Takeaways

  • SWE-bench Verified benchmarks are increasingly unreliable due to test flaws and suspected training data leakage into frontier models.
  • The analysis argues the benchmark now mismeasures real coding capability rather than tracking genuine progress.
  • The authors recommend SWE-bench Pro as a replacement, presumably with stricter contamination controls.
  • Benchmark contamination is a known failure mode: when eval data leaks into training, scores rise without corresponding capability gains.

Hacker News Comment Review

  • No substantive HN discussion yet. The one comment is about the site auto-translating to French, not the benchmark claims.

Original | Discuss on HN