AINeutralarXiv – CS AI · 5h ago6/10
🧠
Resilient AI Supercomputer Networking using MRC and SRv6
OpenAI and Microsoft have deployed MRC, a new RDMA-based transport protocol combined with SRv6 static routing, to eliminate tail latency issues in massive AI training clusters exceeding 100K GPUs. The system uses multi-plane Clos topologies and intelligent load-balancing to bypass network failures without interrupting synchronous training jobs, addressing a critical bottleneck in frontier model development.
🏢 OpenAI