Supercomputer networking to accelerate large scale AI training

· ai systems · Source ↗

TLDR

  • OpenAI releases MRC (Multipath Reliable Connection) via OCP, a RoCE-extending protocol using packet spraying and SRv6 source routing to harden large-scale GPU training networks.

Key Takeaways

  • MRC splits 800Gb/s interfaces into multiple 100Gb/s planes, enabling a two-tier switch topology connecting 131,000+ GPUs instead of three or four tiers.
  • Adaptive packet spraying distributes a single transfer across hundreds of paths simultaneously, eliminating hot-spots that stall synchronous pretraining collectives.
  • SRv6 static source routing replaces BGP entirely; switches follow fixed tables, removing whole classes of dynamic routing failures.
  • Packet trimming lets congested switches forward headers only, triggering explicit retransmission and preventing false path-failure signals.
  • Already deployed on OpenAI’s NVIDIA GB200 clusters at OCI Abilene and Microsoft Fairwater; used to train multiple frontier models including ChatGPT and Codex.

Hacker News Comment Review

  • No substantive HN discussion yet.

Original | Discuss on HN