Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters

Name: Nvidia CTO Michael Kagan: Scaling Beyond Moore's Law to Million-GPU Clusters
Uploaded: 2025-10-28T12:00:00.000000Z
Description: Nvidia CTO Michael Kagan explains how Mellanox networking—not just GPU compute—became the critical bottleneck and differentiator for AI clusters at scale. AI model size has grown 2x every 3 months sin…

Oct 28, 2025 · hardware · Source ↗

Summary based on the YouTube transcript and episode description.

Nvidia CTO Michael Kagan explains how Mellanox networking—not just GPU compute—became the critical bottleneck and differentiator for AI clusters at scale.

AI model size has grown 2x every 3 months since ~2011, requiring 10x–16x annual performance gains vs. Moore’s Law’s 2x per two years.
Network jitter, not raw bandwidth, determines how many GPUs a job can practically parallelize across—wide latency variance forces splitting to 10 GPUs instead of 1,000.
At 100,000-GPU scale, component failure probability reaches near-certainty; hardware and software must be designed to keep running despite constant partial failures.
Inference compute demand now rivals or exceeds training: reasoning models run thousands of sequential inferences per query, and a trained model is inferred billions of times.
Nvidia is building separate GPU SKUs optimized for prefill (compute-intensive) vs. decode (memory-intensive) inference phases.
XAI’s current large cluster runs ~100–150 MW; industry is now planning gigawatt and 10-gigawatt data centers, driving a full shift to liquid cooling.
Nvidia accelerated its product release cadence from every two years to every year, targeting roughly 10x performance improvement per generation.
Kagan argues AI could surface entirely unknown laws of physics by generalizing across observed phenomena the way theoretical physicists do, but at far greater scale.

2025-10-28 · Watch on YouTube

Related coverage

Windows 11's second-chance setup dialogs hurt IT, drain productivity

Tim Cook Is Leaving. Good

OpenAI could be making a phone with AI agents replacing apps

Vinted hits €8bn valuation after EQT-led share sale