Tasks span compact CLI tools to major projects: FFmpeg, SQLite, PHP interpreter.
Best model passes 95% of tests on only 3% of tasks; no model fully resolves any task across 9 LLMs tested.
Models default to monolithic single-file implementations, diverging sharply from human-written modular codebases.
Hacker News Comment Review
Discussion is thin and largely skeptical; commenters anticipate benchmark results will be dismissed by practitioners who claim the right agent setup wasn’t used.
One commenter sarcastically extrapolates to a future where models produce machine code directly, skipping compilers entirely, flagged as /s.
Notable Comments
@makerofthings: “If it works, the AI is magic. If it doesn’t work, you’re using it wrong” – captures the unfalsifiability critique neatly.