Recency/Frequency Adaptive KV Caching for Large Language Model Serving
Researchers propose an adaptive key-value caching strategy for large language models that dynamically allocates cache space based on recency and frequency patterns, improving upon traditional LRU eviction policies. The approach demonstrates up to 10.8% improvement in cache hit rates and 12.6% reduction in time-to-first-token on synthetic workloads, with more modest gains on real-world conversation data.
This research addresses a fundamental efficiency challenge in LLM inference infrastructure. Key-value caching accelerates model generation by storing intermediate computations, but managing limited cache space across diverse workloads remains problematic. Traditional least-recently-used eviction policies create cache thrashing when multiple unrelated requests compete for space, forcing systems to repeatedly recompute the same values. The proposed adaptive strategy elegantly solves this by weighting cache decisions on both temporal recency and access frequency, allowing the system to preserve frequently-accessed blocks even when older requests remain active.
The advancement builds on growing recognition that LLM serving efficiency directly impacts deployment costs and user experience. As inference becomes increasingly cost-sensitive for production systems, even modest improvements in cache efficiency compound across billions of daily requests. The 10.8% cache hit rate improvement on document QA represents significant optimization potential, though the 2.1% real-world conversation gains suggest effectiveness varies by workload patterns.
For infrastructure providers and LLM deployment platforms, this technique offers a practical optimization path requiring no architectural changes. The approach generalizes to batch inference and maintains interpretability, reducing implementation friction. The interpretability aspect particularly matters for production systems where unexplainable performance variations create operational risk.
Future development should focus on adaptive parameter tuning across diverse production environments. The gap between synthetic and real-world performance gains indicates room for refinement in frequency weighting algorithms. Integration with emerging speculative decoding methods and multi-model serving scenarios remains an open question.
- βAdaptive recency-frequency caching improves KV cache hit rates by up to 10.8% over standard LRU policies on synthetic workloads.
- βReal-world conversation data shows more modest 2.1% performance gains, indicating workload-dependent effectiveness.
- βThe method reduces time-to-first-token by up to 12.6% without requiring architectural changes to existing systems.
- βAdaptive caching prevents cache thrashing caused by unrelated workloads competing for limited space.
- βThe technique generalizes well to batch inference and maintains operational interpretability for production deployment.