#kv-cache-optimization News & Analysis

13 articles tagged with #kv-cache-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

13 articles

AIBullisharXiv – CS AI · Jun 57/10

🧠

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

RedKnot is a new KV cache management system for large language models that optimizes memory efficiency by treating cache differently across attention heads rather than as a uniform block. This head-aware approach enables better resource utilization, higher serving concurrency, and improved scalability without requiring model retraining.

AIBullisharXiv – CS AI · Jun 57/10

🧠

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Researchers propose Cross-Layer Sparse Attention (CLSA), a novel architecture that optimizes long-context LLM inference by sharing both key-value caches and routing indices across decoder layers. The method achieves up to 7.6x decoding speedup and 17.1x throughput improvement at 128K context while maintaining accuracy, addressing the efficiency-quality tradeoff that has constrained existing sparse attention approaches.

AIBullisharXiv – CS AI · Jun 27/10

🧠

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

BudgetDraft is a new training method for sparse-KV speculative decoding that enables faster language model inference under memory constraints. By training drafters to handle multiple KV cache budgets simultaneously, the technique achieves up to 6.55x speedup on mid-to-long context inference while maintaining acceptance rates and reducing GPU memory usage.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

Researchers propose LU-KV, a novel framework for optimizing KV cache eviction in large language models by formulating budget allocation as a combinatorial optimization problem. The approach reduces KV cache size by 80% while maintaining performance, significantly lowering inference latency and GPU memory requirements.

AIBullisharXiv – CS AI · May 97/10

🧠

Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

Researchers introduce SPEED, a novel inference optimization technique for long-context language models that reduces computational cost by materializing key-value cache states only in lower layers during the prefill phase while maintaining full-depth processing during decoding. Testing on Llama-3.1-8B demonstrates 33% improvement in time-to-first-token, 22% improvement in tokens-per-second, and 25% reduction in KV memory with minimal quality degradation, suggesting that prompt tokens don't require persistent full-depth caching.

🧠 Llama

AIBullisharXiv – CS AI · May 77/10

🧠

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

RetentiveKV introduces an entropy-driven optimization method for multimodal large language models that achieves 5x KV cache compression and 1.5x decoding acceleration by reformulating token eviction as continuous memory evolution rather than discrete pruning. The approach addresses limitations of existing compression methods by accounting for visual tokens that gain importance later in decoding and preserving spatial continuity of visual information.

AIBullisharXiv – CS AI · May 47/10

🧠

Make Your LVLM KV Cache More Lightweight

Researchers propose LightKV, a technique that reduces Key-Value cache memory overhead in Large Vision-Language Models by compressing vision tokens using cross-modality message passing guided by text prompts. The method achieves 50% reduction in KV cache size while using only 55% of original vision tokens and reducing computation by up to 40%, maintaining performance across eight benchmark datasets.

AIBullisharXiv – CS AI · May 17/10

🧠

Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

Researchers present a unified system for optimizing KV cache memory management in large-scale GPU inference, addressing three critical inefficiencies through architecture-aware sizing, multi-tier memory hierarchy spanning CPU to NVMe storage, and predictive eviction policies. The approach achieves 70-84% cache hit rates and projects 1.4-2.1x improvements in latency and 1.7-2.9x throughput gains while reducing costs by 47% compared to existing solutions.

AIBullisharXiv – CS AI · Apr 147/10

🧠

MEMENTO: Teaching LLMs to Manage Their Own Context

Researchers introduce MEMENTO, a method enabling large language models to compress their reasoning into dense summaries (mementos) organized into blocks, reducing KV cache usage by 2.5x and improving throughput by 1.75x while maintaining accuracy. The technique is validated across multiple model families using OpenMementos, a new dataset of 228K annotated reasoning traces.

AIBullisharXiv – CS AI · Jun 106/10

🧠

ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models

ReasonAlloc is a training-free framework that optimizes key-value cache memory allocation during LLM inference for reasoning tasks by using hierarchical, non-uniform budget distribution across layers and attention heads. The method significantly reduces memory bottlenecks in chain-of-thought reasoning while maintaining performance, outperforming existing compression approaches on mathematical reasoning benchmarks.

🧠 Llama

AINeutralarXiv – CS AI · Jun 106/10

🧠

RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference

Researchers introduce RKSC, a training-free inference framework that optimizes multi-step LLM reasoning by sharing KV cache across similar branches and implementing early exit mechanisms. The system achieves 3x average speedup over baseline methods with minimal error rates, advancing efficiency in large language model inference without requiring model retraining.

AIBullisharXiv – CS AI · May 296/10

🧠

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

Researchers introduce VideoMLA, a novel approach that reduces KV cache memory requirements in video diffusion models by 92.7% through Multi-Head Latent Attention, enabling longer video generation with improved efficiency. The method challenges conventional assumptions about low-rank approximations in video models and demonstrates comparable quality to existing methods while improving throughput by 23%.

AIBullisharXiv – CS AI · May 126/10

🧠

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

Researchers present KV-RM, a runtime optimization that manages KV-cache memory movement in static-graph LLM decoders, achieving better throughput and reduced latency variability without sacrificing the predictability benefits of static graph execution. The approach decouples logical KV histories from physical storage through a block pager and merge-staged transport mechanism, demonstrating practical improvements on multi-GPU systems.

🏢 Nvidia