Arena AI Model ELO History

· ai design · Source ↗

TLDR

  • Interactive chart tracking each major AI lab’s flagship model ELO over time using daily-fetched LMSYS Arena leaderboard data, exposing post-launch degradation and nerfs.

Key Takeaways

  • Data pulls daily from the official LM Arena Leaderboard Dataset on Hugging Face, sourced from thousands of blind crowdsourced human preference votes.
  • Each lab gets one curve tracking its highest-ELO flagship at any moment, not just the latest announced model, so Opus stays on top if it still leads Sonnet.
  • Inference-mode suffixes like -thinking, -reasoning, and -high are collapsed into one model to prevent curve flip-flopping.
  • Chart distinguishes two risk patterns: hard drops at release (new model underperforms predecessor) and slow inter-release decay visible as downward slope.
  • API benchmarks may not capture silent quantization or system-prompt wrappers added in consumer UIs like chatgpt.com or gemini.com.

Hacker News Comment Review

  • Key methodological caveat: apparent inter-release ELO decay is likely relative, not absolute. As stronger models enter the Arena population, older models lose more head-to-head matchups, pulling their ELO down even if raw capability is unchanged.

Notable Comments

  • @underyx: “the slow performance decays the decays are just more capable other models entering the population, making all prior models lose more frequently”

Original | Discuss on HN