https://ndaybench.winfunc.com
Article
-
Benchmark tests LLMs on finding real N-day vulnerabilities in actual codebases
-
Three-agent harness: Curator builds answer key, Finder explores code, Judge scores blindly
-
Finder gets 24 shell steps starting from sink hints, never sees the patch
Discussion
-
Possible harness bug flagged: GPT appears to score implausibly on one case
-
Rubric is ‘vibe-coded’ per critics — judge can alter weights, undermining reproducibility
-
Community asks for OSS harness release to enable independent verification
-
Requests to add open-source models (Gemma, Qwen) and false-positive test cases
Discuss on HN