AIBullisharXiv – CS AI · 10h ago7/10
🧠
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
Researchers propose RDKV, a novel compression technique that jointly optimizes eviction and quantization of the Key-Value cache in large language models to reduce memory bottlenecks during inference. The method achieves 4.5x decode speedup and 1.9x peak memory reduction on 128K context lengths while maintaining 97.81% accuracy, addressing a critical performance constraint in LLM deployment.