N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

https://ndaybench.winfunc.com

Article

  • Benchmark tests LLMs on finding real N-day vulnerabilities in actual codebases
  • Three-agent harness: Curator builds answer key, Finder explores code, Judge scores blindly
  • Finder gets 24 shell steps starting from sink hints, never sees the patch

Discussion

  • Possible harness bug flagged: GPT appears to score implausibly on one case
  • Rubric is ‘vibe-coded’ per critics — judge can alter weights, undermining reproducibility
  • Community asks for OSS harness release to enable independent verification
  • Requests to add open-source models (Gemma, Qwen) and false-positive test cases

Discuss on HN


Type Link
Added Apr 14, 2026
Modified Apr 14, 2026