Google’s new TorchTPU stack lets developers run PyTorch on TPUs by changing one line of init code, using XLA and StableHLO under the hood.
Key Takeaways
Three eager execution modes: Debug (sync, slow), Strict (async), and Fused Eager, which auto-fuses op streams for 50-100%+ throughput gains with no user setup.
XLA replaces Torch Inductor as the compiler backend; PyTorch FX graphs translate to StableHLO IR, giving XLA a global view of distributed collective communication across the ICI torus.
TorchTPU fixes a core PyTorch/XLA limitation: MPMD support lets rank 0 run divergent code (logging, analytics) without breaking XLA’s global-view optimization.
Custom kernels via Pallas and JAX work natively through a decorator; Helion DSL support is in progress.
Hardware-aware tip: models hardcoding attention head dims at 64 leave performance on the table; TPU TensorCores peak at 128 or 256.