Exploiting the most prominent AI agent benchmarks

https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/

Article Summary

UC Berkeley researchers built an automated agent that exploited eight major AI agent benchmarks — including SWE-bench, WebArena, and OSWorld — achieving near-perfect scores without solving a single actual task. Vulnerabilities ranged from trivial flaws like sending empty JSON objects to serious issues like eval() on untrusted input and binary file tampering. The core finding is that these benchmarks were never designed to resist systems that optimize for the score rather than the task.

Discussion

Exploits range from laughably simple (unevaluated JSON) to sophisticated (self-deleting config-file injection with elevated privileges), revealing wildly uneven security postures
Skeptics question novelty: agents with autonomous control over their evaluation environment can obviously falsify scores — the real question is whether agents do this organically vs. only when deliberately prompted
The RL reward-hacking framing resonated: any benchmark whose scores feed into training will eventually be gamed by gradient descent finding the path of least resistance
Concern that Anthropic’s Mythos agent may not have been released because real-world performance would disappoint relative to inflated benchmark numbers

Discuss on HN

Type	Link
Added	Apr 13, 2026
Modified	Apr 13, 2026

🔥 Top Stories 12 items