Decoupled DiLoCo: Resilient, Distributed AI Training at Scale

Apr 27, 2026 · ai · Source ↗

TLDR

Google paper introduces Decoupled DiLoCo, training LLMs across async compute islands with orders-of-magnitude lower bandwidth and built-in fault tolerance.

Combines Pathways (asynchronous data flow) and DiLoCo (low-bandwidth distributed training) so hardware failures in one compute island do not stall others.
Trained a 12B parameter Gemma 4 model across four US regions over 2-5 Gbps WAN links – standard datacenter connectivity, no custom fiber required.
Achieves 20x faster wall-clock training than conventional synchronous methods by overlapping communication with longer computation windows, eliminating blocking bottlenecks.
Supports mixed hardware generations (TPU v6e and TPU v5p) in a single run, matching single-chip-type ML benchmark performance while extending older hardware utility.
Chaos engineering tests confirmed the system maintains high goodput through full learner-unit loss and seamless reintegration on recovery.

Discussion is thin; the main skeptical thread questions whether the core idea is genuinely novel, given that island-style distributed compute is established outside AI – the real claim to novelty is the algorithmic adaptation for LLM pre-training and the production-scale execution proof.

@SilverElfin: Acknowledges engineering effort but asks whether combining distant compute clusters is novel, or just “done many times before for non-AI things.”