The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
Researchers have discovered that FP16 floating-point precision causes systematic numerical divergence between KV-cached and cache-free inference in transformer models, producing 100% token divergence across multiple architectures. This challenges the long-held assumption that KV caching is numerically equivalent to standard computation, with controlled FP32 experiments confirming FP16 non-associativity as the causal mechanism.
The study reveals a fundamental numerical instability in current large language model inference optimization practices. KV caching has been industry standard for reducing computational overhead in autoregressive decoding, but this research demonstrates that under FP16 precision—widely used in production for memory efficiency—the optimization produces mathematically different token sequences than cache-free computation. The divergence is not random or dependent on sampling strategy; even greedy decoding shows 100% token differences, indicating a deterministic arithmetic phenomenon rooted in floating-point non-associativity.
This finding emerges from the mathematical properties of low-precision arithmetic. FP16 accumulation orders differ between cached and non-cached paths, and because floating-point operations lack associativity, different ordering produces different results. The researchers validated this through FP32 falsification experiments, which reduced divergence by eight orders of magnitude and eliminated token flips entirely, definitively establishing the causal mechanism.
The practical implications are substantial for LLM deployment. Models currently operating with FP16 KV caching may produce subtly different outputs than theoretically expected, potentially affecting downstream applications relying on consistency or reproducibility. The architectural analysis—showing sharp divergence in Grouped-Query Attention models but uniform patterns in alternatives like Gemma—suggests that inference precision requirements vary by model design.
Looking forward, this work necessitates reconsideration of numerical precision tradeoffs in production systems. Organizations must evaluate whether current deployment practices sacrifice accuracy for speed, and whether higher-precision alternatives or architectural modifications are warranted for critical applications.
- →KV caching in FP16 produces 100% systematic token divergence from cache-free computation due to floating-point non-associativity
- →The divergence is deterministic and reproducible across greedy decoding, eliminating sampling randomness as a factor
- →FP32 validation experiments confirmed FP16 non-associativity as the sole causal mechanism behind the numerical divergence
- →Different transformer architectures exhibit predictable divergence patterns: Grouped-Query Attention shows sharp first-layer divergence while Gemma shows uniform patterns
- →Cache-ON execution achieved higher accuracy in 8 of 9 test conditions, indicating systematic rather than random numerical differences