Interactive chart tracking each major AI lab’s flagship model ELO over time using daily-fetched LMSYS Arena leaderboard data, exposing post-launch degradation and nerfs.
Key Takeaways
Data pulls daily from the official LM Arena Leaderboard Dataset on Hugging Face, sourced from thousands of blind crowdsourced human preference votes.
Each lab gets one curve tracking its highest-ELO flagship at any moment, not just the latest announced model, so Opus stays on top if it still leads Sonnet.
Inference-mode suffixes like -thinking, -reasoning, and -high are collapsed into one model to prevent curve flip-flopping.
Chart distinguishes two risk patterns: hard drops at release (new model underperforms predecessor) and slow inter-release decay visible as downward slope.
API benchmarks may not capture silent quantization or system-prompt wrappers added in consumer UIs like chatgpt.com or gemini.com.
Hacker News Comment Review
Key methodological caveat: apparent inter-release ELO decay is likely relative, not absolute. As stronger models enter the Arena population, older models lose more head-to-head matchups, pulling their ELO down even if raw capability is unchanged.
Notable Comments
@underyx: “the slow performance decays the decays are just more capable other models entering the population, making all prior models lose more frequently”