TLDR
-
SWE-bench Verified is contaminated by flawed tests and training leakage, making it a poor signal for frontier coding progress.
Key Takeaways
-
SWE-bench Verified benchmarks are increasingly unreliable due to test flaws and suspected training data leakage into frontier models.
-
The analysis argues the benchmark now mismeasures real coding capability rather than tracking genuine progress.
-
The authors recommend SWE-bench Pro as a replacement, presumably with stricter contamination controls.
-
Benchmark contamination is a known failure mode: when eval data leaks into training, scores rise without corresponding capability gains.
Hacker News Comment Review
-
No substantive HN discussion yet. The one comment is about the site auto-translating to French, not the benchmark claims.
Original | Discuss on HN