Pretraining large language models with MXFP4
Researchers identify weight gradient (Wgrad) quantization as the primary cause of instability in FP4 training of large language models, while forward and activation gradient quantization prove relatively benign. Using deterministic Hadamard rotations on AMD MI355X GPUs, they demonstrate that structured micro-scaling errors—not insufficient randomness—drive training divergence, offering insights for efficient LLM pretraining.