Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression
Researchers propose Hurwitz Quaternion Multiplicative Quantization (HQMQ), a calibration-free method for compressing KV caches in large language models using quaternion mathematics. The technique achieves 5x compression with minimal perplexity loss, matching full-precision performance at ~5 bits while outperforming existing quantization methods across five major model architectures.
HQMQ addresses a critical bottleneck in deploying large language models: the memory overhead of key-value caches during inference. As context windows expand and batch sizes grow, KV cache compression has become essential for practical deployment. This work leverages quaternion algebra—representing 4-element vector chunks as points on a 3-sphere—to create a novel quantization approach that fundamentally differs from traditional integer quantization methods.
The innovation's strength lies in its calibration-free design and mathematical elegance. By composing 24 Hurwitz group elements with random per-layer codebooks, HQMQ generates 24S effective codewords without requiring dataset-specific calibration. This is operationally significant because calibration adds computational overhead and can introduce distribution shifts. The median-multiplier outlier extraction handles modern architectures with extreme activation patterns—a known failure mode for naive int4 quantization.
Practical impact is substantial. On Llama-3-70B with 128k context, HQMQ reduces cache size from 43GB to 8.5GB while maintaining full-precision quality—enabling inference scenarios previously infeasible on consumer hardware. The method achieves 3-1900x quality improvements over naive int quantization at matched bits. Critically, HQMQ at 3.79 bits matches KIVI-4 (a calibrated baseline at ~4.5 bits) on multiple downstream tasks, demonstrating that calibration-free approaches can rival calibrated methods.
For LLM deployment practitioners, this enables longer context windows and larger batch sizes on fixed memory budgets. The lack of calibration requirement simplifies deployment workflows, particularly valuable for edge deployment and dynamic model serving scenarios.
- →HQMQ achieves 5x KV cache compression with calibration-free quantization using quaternion mathematics, reducing Llama-3-70B 128k cache from 43GB to 8.5GB
- →Method matches full-precision performance at ~5 bits across five model architectures while outperforming naive int4 by 3-1900x quality margin
- →Quaternion-based approach eliminates need for dataset calibration while handling outlier-heavy modern architectures through per-batch median-multiplier extraction
- →At 3.79 bits, HQMQ matches calibrated KIVI-4 baseline (4.5 bits) on CoQA, TruthfulQA, and GSM8K within 0.6-2.3 points despite using 16% fewer bits
- →Results span diverse architectures: dense MHA, grouped query attention, and sparse MoE models, indicating broad applicability for LLM inference optimization