TorchTPU: Running PyTorch Natively on TPUs at Google Scale

· ai systems coding · Source ↗

TLDR

  • Google’s new TorchTPU stack lets developers run PyTorch on TPUs by changing one line of init code, using XLA and StableHLO under the hood.

Key Takeaways

  • Three eager execution modes: Debug (sync, slow), Strict (async), and Fused Eager, which auto-fuses op streams for 50-100%+ throughput gains with no user setup.
  • XLA replaces Torch Inductor as the compiler backend; PyTorch FX graphs translate to StableHLO IR, giving XLA a global view of distributed collective communication across the ICI torus.
  • TorchTPU fixes a core PyTorch/XLA limitation: MPMD support lets rank 0 run divergent code (logging, analytics) without breaking XLA’s global-view optimization.
  • Custom kernels via Pallas and JAX work natively through a decorator; Helion DSL support is in progress.
  • Hardware-aware tip: models hardcoding attention head dims at 64 leave performance on the table; TPU TensorCores peak at 128 or 256.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN