How Intelligent Is AI, Really?
ARC Prize president Greg Kamradt explains why ARC-AGI became the standard AGI benchmark and what ARC-AGI v3’s game-based interactive test will reveal.
- GPT-4 scored 4% on ARC-AGI in 2024; OpenAI o1 jumped to 21% on release, signaling the reasoning paradigm shift.
- ARC-AGI 1 (2019) had 800 tasks built by François Chollet alone; ARC-AGI 2 launched March 2025 as a harder static version.
- ARC-AGI v3 (2026) uses ~150 interactive video-game environments with zero text instructions — models must infer the goal from actions and feedback.
- V3 will measure efficiency by action count: AI actions-to-win normalized against average human actions-to-win, not wall-clock time.
- OpenAI, xAI (Grok 4), Gemini (3 Pro), and Anthropic (Opus 4.5) now all report ARC-AGI scores in model releases.
- Chollet’s position: solving ARC-AGI is necessary but not sufficient for AGI — v3 will be the most authoritative evidence of generalization to date.
- Kamradt flags RL-environment gaming as a false positive: tuning on specific RL setups wins benchmarks without achieving real generalization.
2025-12-17 · Watch on YouTube