RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
Researchers propose RDKV, a novel compression technique that jointly optimizes eviction and quantization of the Key-Value cache in large language models to reduce memory bottlenecks during inference. The method achieves 4.5x decode speedup and 1.9x peak memory reduction on 128K context lengths while maintaining 97.81% accuracy, addressing a critical performance constraint in LLM deployment.
The Key-Value cache represents a fundamental performance bottleneck in LLM inference, particularly as context lengths expand. During decoding, the KV cache must be repeatedly transferred from high-bandwidth memory to on-chip memory, creating a memory-bound operation that limits throughput regardless of computational capacity. RDKV reframes this engineering challenge as a unified rate-distortion optimization problem, treating cache eviction and quantization as endpoints on a continuous spectrum rather than isolated techniques. This perspective shift enables more sophisticated trade-offs between precision and cache size.
The approach leverages attention computation distortion to assign importance weights to individual tokens and channels, then applies reverse water-filling—an information-theoretic principle—to determine optimal bit-width allocations. This principled methodology contrasts with heuristic-based compression schemes that lack theoretical grounding. The experimental results demonstrate substantial practical improvements: recovering near-full accuracy (97.81%) with only 2.48% cache retention on LongBench represents a significant compression ratio.
For the AI infrastructure sector, RDKV addresses production deployment constraints that currently limit context window utilization and throughput. Organizations running long-context applications face hardware costs tied directly to memory requirements; techniques delivering 1.9x memory reduction without proportional accuracy loss directly impact operational economics. The 4.5x decode speedup translates to improved latency for real-time applications and better resource utilization in batch processing environments.
Longer-term, this research direction signals movement toward adaptive, theoretically-principled compression schemes that move beyond fixed quantization strategies. As context lengths continue expanding, systematic approaches to joint optimization become increasingly valuable.
- →RDKV unifies KV cache eviction and quantization under rate-distortion theory, enabling joint optimization rather than isolated compression techniques.
- →The method achieves 4.5x decode speedup and 1.9x peak memory reduction on 128K contexts while maintaining 97.81% accuracy on LongBench.
- →Reverse water-filling algorithm assigns per-token and per-channel bit-widths based on attention distortion, creating adaptive precision allocation.
- →Results across LongBench, RULER, and InfiniteBench show 9.1% average improvement over existing baselines.
- →The technique directly reduces operational costs and hardware requirements for long-context LLM deployment in production environments.