N-Day-Bench – Can LLMs find real vulnerabilities in real codebases?

https://ndaybench.winfunc.com

Article

Benchmark tests LLMs on finding real N-day vulnerabilities in actual codebases
Three-agent harness: Curator builds answer key, Finder explores code, Judge scores blindly
Finder gets 24 shell steps starting from sink hints, never sees the patch

Discussion

Possible harness bug flagged: GPT appears to score implausibly on one case
Rubric is ‘vibe-coded’ per critics — judge can alter weights, undermining reproducibility
Community asks for OSS harness release to enable independent verification
Requests to add open-source models (Gemma, Qwen) and false-positive test cases

Type	Link
Added	Apr 14, 2026
Modified	Apr 14, 2026

🔥 Top Stories 121 items