From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs
A production analysis of a 504-GPU NVIDIA B200 cluster reveals that large-scale AI training requires multi-signal failure detection strategies, with a 100% detection rate achieved through statistical analysis of 751 metrics. The study identifies storage I/O bottlenecks invisible at smaller scales and shows auto-retry mechanisms succeed 2.7x more often than manual recovery, providing critical operational insights for distributed AI infrastructure.
This technical report addresses a critical gap in public operational data from production AI clusters, presenting empirical findings from a cross-organizational 504-GPU facility operated jointly by SKT, Upstage, Lablup, NVIDIA Korea, and VAST Data. The collaborative infrastructure enabled detection of system-level phenomena that individual organizations could not isolate independently, particularly a 60-node storage I/O bottleneck demonstrating how distributed systems failures emerge only at scale.
The research establishes that GPU failure detection cannot rely on single metrics; instead, statistical analysis across 751 Prometheus metrics achieved perfect detection rates while reducing false positives to 0.84 per day. This finding challenges assumptions about metric dominance and validates multi-signal approaches essential for reliable large-scale training. The "bandwidth paradox"—where 200 Gbps RoCE networks operate at only 1.4-10.4% utilization—traces to NFS RPC layer saturation at 128 slots, revealing infrastructure bottlenecks orthogonal to network hardware specifications.
Operational recovery patterns show concentrated failure distribution (top 3 nodes account for >50% of exclusions) and automated retry mechanisms outperforming manual intervention by 2.7x, achieving 33.3% success rates versus 12.5% manually. These findings directly impact AI infrastructure design and operational protocols across the industry. Organizations building large-scale training clusters must architect unified observability pipelines, implement multi-signal detection systems, and prioritize automated recovery mechanisms. The median 11-minute retry interval suggests optimal timeout configurations for minimizing training disruption while balancing resource utilization.
- →Perfect GPU failure detection requires multi-signal statistical analysis across hundreds of metrics rather than reliance on single dominant indicators
- →Storage I/O bottlenecks manifest only at 60+ node scales, creating critical infrastructure challenges invisible to smaller deployments
- →Automated retry mechanisms achieve 2.7x higher recovery success rates than manual intervention in production AI clusters
- →Cross-organizational monitoring pipelines enable detection of distributed system phenomena impossible for individual teams to diagnose
- →NFS RPC layer saturation, not network bandwidth, constrains checkpoint performance in large-scale GPU clusters