STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
Researchers introduce STAR-KV, an adaptive compression framework that reduces KV cache memory requirements in large language models by up to 75% through low-rank projections and intelligent rank selection. The technique achieves up to 20x compression when combined with quantization and delivers significant speedups in attention computation, addressing a critical bottleneck in LLM inference efficiency.
STAR-KV addresses a fundamental challenge in deploying large language models: the memory footprint of key-value caches during inference. As LLMs generate tokens sequentially, maintaining cached KV pairs grows linearly with sequence length, becoming a primary constraint on batch size and throughput. This research demonstrates that significant redundancy exists in these cache dimensions, exploitable through adaptive low-rank compression without substantial accuracy loss.
The innovation lies in three technical components working in concert. A differentiable thresholding mechanism enables fine-grained rank selection per attention head and block, moving beyond fixed compression ratios that sacrifice performance uniformly. The hybrid decomposition strategy recognizes that key and value projections exhibit different sensitivity characteristics, applying tailored compression strategies rather than uniform treatment. Integration of mixed-precision quantization leverages statistical properties of low-rank components for near-lossless compression at reduced bit depths.
For the AI infrastructure sector, this advancement has immediate practical implications. The 6.9x speedup in attention modules and 3.1x end-to-end throughput improvements directly translate to reduced inference costs and improved service capacity. This efficiency gain becomes particularly valuable for real-time applications and cost-sensitive deployments where inference expenses dominate operational budgets. The public availability of code accelerates adoption across research and production environments.
Longer-term, optimizations like STAR-KV reduce the hardware requirements for serving LLMs, potentially democratizing access to frontier models. This efficiency trend counteracts the scaling pressures from increasingly large models, creating a dynamic where inference becomes progressively more accessible despite growing model complexity. Future work may explore runtime adaptivity and integration with emerging hardware accelerators.
- βSTAR-KV achieves up to 75% KV cache compression through adaptive low-rank projection with fine-grained rank control at attention-head and block levels
- βCombined with quantization, the method delivers up to 20x total compression while maintaining model accuracy across multiple LLM architectures
- βCustom GPU kernels enable 6.9x attention module speedup and 3.1x end-to-end generation throughput improvement in production deployments
- βDifferentiable thresholding mechanism enables optimal rank selection automatically rather than relying on fixed or heuristic approaches
- βPublic code release accelerates adoption in research and production environments addressing critical KV cache bottlenecks in LLM inference