Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe
Researchers identify a fundamental flaw in current FP4 training approaches for large language models: E2M1 formats suffer from systematic "Shrinkage Bias" that degrades training stability. They propose UFP4, a uniform 4-bit recipe using E1M2/INT4 grids that outperforms existing E2M1 baselines across multiple model scales, suggesting future AI accelerators should prioritize uniform grid formats for training.
This technical research addresses a critical pain point in LLM pretraining infrastructure. Current approaches to reducing computational costs through 4-bit floating-point (FP4) training have leveraged non-uniform number formats like E2M1, implemented in leading hardware like NVIDIA's Blackwell and AMD's MI350 GPUs. The authors demonstrate that this choice introduces a systematic bias—rounding errors accumulate multiplicatively through network layers, particularly when combined with Random Hadamard Transform optimization techniques used to improve quantization quality.
The research traces this problem to fundamental geometric asymmetry in how non-uniform formats distribute representable numbers. Unlike their approach, uniform grids such as E1M2 and INT4 distribute values symmetrically, eliminating the inherent directional bias. This distinction becomes magnified in deep networks where errors compound across layers, explaining previously documented training instability.
The proposed UFP4 recipe demonstrates consistent improvements in convergence and final model quality compared to E2M1 baselines, validated across multiple scales from 1.5B parameter models to 124B mixture-of-experts architectures. The ablation studies isolate which components drive these gains, providing actionable guidance for practitioners.
For the AI infrastructure industry, this work highlights an important gap between current hardware design and optimal training mathematics. It suggests that future accelerator development should reconsider the widespread adoption of E2M1 as a standard, potentially requiring design revisions to support uniform grids as first-class primitives. This could influence roadmaps for major semiconductor vendors and impact cost-efficiency calculations for large-scale model training operations.
- →E2M1 FP4 formats suffer from systematic shrinkage bias that accumulates multiplicatively across layers, causing training instability.
- →Uniform grid formats like E1M2 and INT4 eliminate geometric asymmetry and deliver superior quantization quality in 4-bit training.
- →UFP4 achieves lower loss degradation than E2M1 baselines across 1.5B to 124B parameter models with scaled pretraining.
- →Current hardware implementations prioritizing E2M1 may need architectural reconsideration to support uniform grids as first-class training primitives.
- →The research provides mathematical explanation for training instability previously observed empirically in FP4 approaches.