Lambda Calculus Benchmark for AI

· ai · Source ↗

TLDR

  • LamBench scores AI models on 120 pure lambda calculus problems using Lamb, a minimal language where all data structures are λ-encoded.

Key Takeaways

  • gpt-5.4 leads at 91.7% (110/120); opus-4.6 follows at 90.0% and gpt-5.3-codex at 89.2%.
  • opus-4.7 and gemini-3.1-pro-preview tie at 88.3%; sonnet-4.6 scores 82.5%; a clear tier break separates these from the rest.
  • gpt-5 (the unversioned flagship) scores only 66.7%, ranking 16th out of 21 models tested.
  • gpt-5.3-codex-spark collapses to 11.7% while its sibling gpt-5.3-codex ranks 3rd at 89.2% – a 77-point gap within the same family.
  • gemma-4-31b-it scores 18.3% and deepseek-v4-pro 53.3%, suggesting open and non-frontier models struggle specifically with pure lambda reasoning.

Hacker News Comment Review

  • The single-attempt one-shot methodology is contested: benchmarking non-deterministic probabilistic models reliably requires roughly 45 runs per problem to capture variance, not one.
  • The GitHub repo (VictorTaelin/LamBench) confirms the scope: 120 problems, Lamb language, λ-encodings of data structures, with live results published.

Notable Comments

  • @dataviz1000: “The models are reliably incorrect” – argues single-shot scores mislead because LLM outputs are non-deterministic and need repeated sampling to produce valid rankings.

Original | Discuss on HN