The last six months in LLMs in five minutes

· ai ai-agents coding · Source ↗

TLDR

  • Annotated PyCon US 2026 lightning talk covering the November 2025 inflection point when coding agents crossed from “often-work” to “mostly-work” reliability.

Key Takeaways

  • November 2025 saw the “best” model crown change hands five times across Anthropic, OpenAI, and Google; Claude Opus 4.5 held it for months after.
  • Coding agents (Codex, Claude Code) became daily-driver tools after sustained RL from Verifiable Rewards training through most of 2025.
  • OpenClaw, started as an obscure repo in November, became a phenomenon by February; “Claws” is now a generic term for personal AI assistants.
  • Laptop-class open-weight models made the biggest surprise gains: GLM-5.1 (1.5TB, open weight) and Google’s Gemma 4 series outperformed expectations.
  • The pelican-on-bicycle SVG benchmark, used throughout, tracks frontier model visual reasoning without any targeted training incentive.

Hacker News Comment Review

  • Practitioners split sharply on the “inflection point” claim: some report 10x speed gains using Claude Code as an exoskeleton for JS/TS/Elixir/Ruby; others find agents still struggle to produce fully fledged production apps without heavy steering.
  • A skeptical thread argues the past six months also brought memory-market cornering that suppressed local AI adoption, IP-exfiltrating agent tools spreading inside companies, and developers shipping more code than they can read.
  • The pelican benchmark’s credibility is contested: it traces to Microsoft’s 2024 GPT-4 “Sparks of AGI” paper SVG tests, and commenters note labs could now easily train for this specific prompt given its public profile.

Notable Comments

  • @simonw: author says he did not submit this post and did not expect it to fit HN.

Original | Discuss on HN