Pretraining large language models with MXFP4
Researchers identify weight gradient (Wgrad) quantization as the primary cause of instability in FP4 training of large language models, while forward and activation gradient quantization prove relatively benign. Using deterministic Hadamard rotations on AMD MI355X GPUs, they demonstrate that structured micro-scaling errors—not insufficient randomness—drive training divergence, offering insights for efficient LLM pretraining.
This research addresses a critical bottleneck in efficient large language model training: full-pipeline FP4 (4-bit floating point) quantization consistently fails despite stable forward and backward propagation. The study's controlled experimental approach isolates which components of the training pipeline actually cause convergence problems, revealing that weight gradient quantization bears primary responsibility while forward propagation and activation gradient quantization introduce only modest computational overhead. This finding challenges conventional assumptions about quantization stability.
The breakthrough emerges from native hardware support on AMD Instinct MI355X GPUs, enabling precise empirical investigation without software emulation artifacts. Previous attempts to stabilize FP4 training relied on stochastic techniques—stochastic rounding and randomized transformations—which intuition suggested should smooth gradient noise. The research demonstrates these approaches fail because the core problem isn't randomness but systematic micro-scaling errors accumulated along gradient computation pathways. Deterministic Hadamard rotations successfully restore stability by structurally correcting these errors.
For the AI infrastructure industry, this has immediate implications for training efficiency. Identifying Wgrad as the critical problem enables targeted optimization rather than blanket quantization across all pipeline components. Developers can now confidently deploy FP4 in forward and activation gradient paths with minimal accuracy loss while focusing engineering effort on robust Wgrad handling. This selective quantization approach reduces memory bandwidth and compute requirements significantly.
Looking forward, the deterministic correction methodology may generalize to other quantization schemes and mixed-precision training paradigms. The validation on production-grade AMD hardware suggests manufacturers can provide native FP4 support with confidence, potentially accelerating adoption across data centers seeking cost-effective large model training.
- →Weight gradient quantization drives FP4 training instability, not forward or activation gradient quantization
- →Deterministic Hadamard rotations reliably stabilize training where stochastic approaches consistently fail
- →Structured micro-scaling errors along gradient paths, rather than insufficient randomness, cause convergence degradation
- →Native FP4 support on AMD MI355X GPUs enables practical investigation without software emulation limitations
- →Selective FP4 deployment targeting only weight gradients can achieve significant efficiency gains with minimal accuracy loss