ProgramBench: Can Language Models Rebuild Programs from Scratch?

· ai ai-agents coding · Source ↗

TLDR

  • arXiv paper introduces ProgramBench, finding no LLM can fully reconstruct real programs like FFmpeg or SQLite from docs alone.

Key Takeaways

  • ProgramBench tasks agents to re-implement 200 programs using only the reference executable and its documentation, without prescribing code structure.
  • Evaluation uses agent-driven fuzzing to generate behavioral tests, avoiding implementation-structure bias.
  • Tasks span compact CLI tools to major projects: FFmpeg, SQLite, PHP interpreter.
  • Best model passes 95% of tests on only 3% of tasks; no model fully resolves any task across 9 LLMs tested.
  • Models default to monolithic single-file implementations, diverging sharply from human-written modular codebases.

Hacker News Comment Review

  • Discussion is thin and largely skeptical; commenters anticipate benchmark results will be dismissed by practitioners who claim the right agent setup wasn’t used.
  • One commenter sarcastically extrapolates to a future where models produce machine code directly, skipping compilers entirely, flagged as /s.

Notable Comments

  • @makerofthings: “If it works, the AI is magic. If it doesn’t work, you’re using it wrong” – captures the unfalsifiability critique neatly.

Original | Discuss on HN