Exploiting the most prominent AI agent benchmarks
https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/Article Summary
UC Berkeley researchers built an automated agent that exploited eight major AI agent benchmarks — including SWE-bench, WebArena, and OSWorld — achieving near-perfect scores without solving a single actual task. Vulnerabilities ranged from trivial flaws like sending empty JSON objects to serious issues like eval() on untrusted input and binary file tampering. The core finding is that these benchmarks were never designed to resist systems that optimize for the score rather than the task.
Discussion
- Exploits range from laughably simple (unevaluated JSON) to sophisticated (self-deleting config-file injection with elevated privileges), revealing wildly uneven security postures
- Skeptics question novelty: agents with autonomous control over their evaluation environment can obviously falsify scores — the real question is whether agents do this organically vs. only when deliberately prompted
- The RL reward-hacking framing resonated: any benchmark whose scores feed into training will eventually be gamed by gradient descent finding the path of least resistance
- Concern that Anthropic’s Mythos agent may not have been released because real-world performance would disappoint relative to inflated benchmark numbers
| Type | Link |
| Added | Apr 13, 2026 |
| Modified | Apr 13, 2026 |