OpenAI releases MRC (Multipath Reliable Connection) via OCP, a RoCE-extending protocol using packet spraying and SRv6 source routing to harden large-scale GPU training networks.
Key Takeaways
MRC splits 800Gb/s interfaces into multiple 100Gb/s planes, enabling a two-tier switch topology connecting 131,000+ GPUs instead of three or four tiers.
Adaptive packet spraying distributes a single transfer across hundreds of paths simultaneously, eliminating hot-spots that stall synchronous pretraining collectives.
Already deployed on OpenAI’s NVIDIA GB200 clusters at OCI Abilene and Microsoft Fairwater; used to train multiple frontier models including ChatGPT and Codex.