Some thoughts on the Sutton interview
Dwarkesh Patel reflects on his Sutton interview, arguing LLM imitation learning and RL are complementary steps toward AGI, not dead ends.
- Sutton’s bitter lesson critique: LLMs waste compute during deployment by not learning, and training oversamples inelastic human data.
- Dwarkesh’s counter: imitation learning is just short-horizon RL — one token per episode — not a categorically different paradigm.
- AlphaGo (human-bootstrapped) vs AlphaZero (scratch): both superhuman; human data isn’t detrimental, just not necessary at scale.
- Ilya Sutskever framed pre-training data as fossil fuels — a necessary, non-renewable intermediary to reach the next energy regime.
- LLMs RL’d on pre-trained priors now win gold at IMO and build full apps; you couldn’t bootstrap that RL from scratch yet.
- Continual learning gap is real: LLMs extract ~1 bit per episode from outcome-based RL, while animals extract high-bandwidth world-model updates.
- Dwarkesh speculates supervised fine-tuning as a tool call (outer-loop RL teaching the model mid-task) could replicate continual learning.
- If LLMs reach AGI first, Dwarkesh expects successor systems built by those AGIs will be based on Sutton’s architecture vision.
2025-10-04 · Watch on YouTube