Building Generative Image & Video models at Scale - Sander Dieleman (Veo and Nano Banana)
Sander Dieleman (Google DeepMind, Veo/Nano Banana) explains every layer of large-scale diffusion model training, from latent compression to guidance and distillation.
- 30 seconds of 1080p video at 30fps is several gigabytes per training example; latent compression reduces tensor size by up to two orders of magnitude, making training feasible.
- Diffusion models are still smaller than frontier LLMs partly because classifier-free guidance lets them punch well above their parameter count in output quality.
- Dieleman frames diffusion as spectral autoregression: adding noise removes high frequencies first, so denoising naturally generates images coarse-to-fine, low-to-high frequency.
- Guidance amplifies the delta between conditional and unconditional predictions each step; removing it today would reveal how poor current models actually are without it.
- Time spent improving data curation often outperforms tweaking the model or optimizer — still underrated and mostly unpublished because it is competitive secret sauce.
- Distillation in the diffusion context means fewer sampling steps (consistency models), not a smaller model — one-step consistency sampling rarely works well in practice.
- Model style and aesthetic opinions come mostly from post-training (RLHF/DPO), not guidance; guidance artifacts like oversaturation signal a guidance scale set too high.
- Google uses JAX with TPUs for model sharding; JAX was designed from the start to minimize chip-to-chip communication automatically.
2026-04-21 · Watch on YouTube