#kv-cache-compression News & Analysis

17 articles tagged with #kv-cache-compression. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

17 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

Researchers introduce EntropyInfer, a training-free framework that optimizes long-context LLM inference by dynamically allocating computational resources based on attention entropy patterns. The method achieves up to 2.39× speedup on models like Llama and Qwen beyond 100k tokens while maintaining output quality, addressing limitations in existing sparse attention and KV cache compression techniques.

🧠 Llama

AIBullisharXiv – CS AI · Jun 97/10

🧠

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

Researchers introduce STAR-KV, an adaptive compression framework that reduces KV cache memory requirements in large language models by up to 75% through low-rank projections and intelligent rank selection. The technique achieves up to 20x compression when combined with quantization and delivers significant speedups in attention computation, addressing a critical bottleneck in LLM inference efficiency.

AIBullisharXiv – CS AI · Jun 97/10

🧠

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

Researchers introduce FlashMemory-DeepSeek-V4, a novel inference system using Lookahead Sparse Attention to reduce GPU memory requirements for long-context LLM serving by 86.5% while maintaining accuracy. The approach uses a neural memory indexer to selectively preserve only critical KV cache chunks, enabling efficient processing of ultra-long contexts up to 500K tokens.

AIBullisharXiv – CS AI · Jun 97/10

🧠

End-to-End Context Compression at Scale

Researchers introduce Latent Context Language Models (LCLMs), a new encoder-decoder compression approach that addresses memory bottlenecks in long-context language model inference. By compressing KV caches at ratios of 1:4 to 1:16 while maintaining model quality, LCLMs enable faster processing of extended contexts and support adaptive expansion for long-horizon agent applications.

AIBullisharXiv – CS AI · May 297/10

🧠

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Researchers introduce Moment-KV, a momentum-based compression technique that optimizes Key-Value cache usage during LLM decoding phases. The method improves long-generation task performance by 2.3-3.2% while maintaining latency by dynamically tracking token importance through temporal attention patterns rather than static heuristics.

AIBullisharXiv – CS AI · May 287/10

🧠

CIVIC: End-to-End Sequence Compactness for Efficient Vision-Language Models

Researchers introduce CIVIC, a framework that optimizes Vision-Language Models by maintaining compact visual token sequences throughout the entire inference pipeline, reducing KV-cache memory to one-third while achieving measurable hardware acceleration without accuracy loss.

AIBullisharXiv – CS AI · May 287/10

🧠

Hurwitz Quaternion Multiplicative Quantization for KV Cache Compression

Researchers propose Hurwitz Quaternion Multiplicative Quantization (HQMQ), a calibration-free method for compressing KV caches in large language models using quaternion mathematics. The technique achieves 5x compression with minimal perplexity loss, matching full-precision performance at ~5 bits while outperforming existing quantization methods across five major model architectures.

🧠 Llama

AIBullisharXiv – CS AI · May 277/10

🧠

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

Researchers have developed a bias correction technique for quantizing KV-cache memory in video diffusion models, addressing a fundamental problem where quantization noise causes inflated attention to cached data. The method recovers near-full quality video generation while using 50% less memory than standard approaches, enabling longer video synthesis without sacrificing output quality.

AIBullisharXiv – CS AI · May 127/10

🧠

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

Researchers propose RDKV, a novel compression technique that jointly optimizes eviction and quantization of the Key-Value cache in large language models to reduce memory bottlenecks during inference. The method achieves 4.5x decode speedup and 1.9x peak memory reduction on 128K context lengths while maintaining 97.81% accuracy, addressing a critical performance constraint in LLM deployment.

AINeutralarXiv – CS AI · May 117/10

🧠

Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression

Researchers introduce KVFundaBench to expose a critical gap in KV cache compression evaluation: while retrieval tasks remain robust under compression, reasoning tasks degrade severely due to disrupted Chain-of-Thought coherence. They propose ShotKV, which preserves semantic integrity by treating few-shot examples as indivisible units, achieving 9-18% accuracy improvements on long-context tasks while reducing latency by 11%.

AIBullisharXiv – CS AI · Apr 207/10

🧠

OjaKV: Context-Aware Online Low-Rank KV Cache Compression

OjaKV introduces a novel framework for compressing key-value caches in large language models through online low-rank projection, addressing a critical memory bottleneck in long-context inference. The method combines selective full-rank storage for important tokens with adaptive compression for intermediate tokens, maintaining accuracy while reducing memory consumption without requiring model fine-tuning.

🧠 Llama

AIBullisharXiv – CS AI · Apr 147/10

🧠

Quantization Dominates Rank Reduction for KV-Cache Compression

A new study demonstrates that quantization significantly outperforms rank reduction for compressing KV caches in transformer inference, achieving 4-364 PPL improvements across multiple models. The research shows that preserving all dimensions while reducing precision is structurally superior to discarding dimensions, with INT4 quantization matching FP16 accuracy while enabling 75% total KV reduction.

AIBullisharXiv – CS AI · Jun 196/10

🧠

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

Researchers introduce UltraQuant, a 4-bit key-value cache compression technique optimized for long-context AI agents that need to process multiple conversation turns efficiently. The method achieves 3.47x faster response times in cache-pressured scenarios and 1.63x higher throughput compared to standard FP8 approaches, with practical optimizations for AMD GPU deployment.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

Researchers propose Semantic Cache Distillation (SCD), a technical framework that significantly reduces communication overhead in large language model inference by replacing raw Key-Value cache transmission with compact semantic codes. The method achieves up to 2.65x speedup in time-to-first-token while maintaining generation quality within 5% of baseline performance, addressing a critical bottleneck in disaggregated LLM serving architectures.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

Researchers develop theoretical bounds for KV cache compression in language models, discovering that context sensitivity decays polynomially rather than exponentially. Their findings enable more efficient memory-aware cache policies that reduce memory requirements while maintaining model performance, with practical implications for deploying larger models on resource-constrained systems.

AIBullisharXiv – CS AI · Jun 26/10

🧠

STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models

Researchers introduce STaR-KV, a training-free compression framework that reduces key-value cache memory consumption in vision-language GUI agents by up to 40% while maintaining accuracy. The method addresses a critical bottleneck where models like UI-TARS-1.5-7B consume prohibitive GPU memory during multi-step interactions, enabling more practical deployment on standard accelerators.

AINeutralarXiv – CS AI · May 126/10

🧠

When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

Researchers present a diagnostic framework for evaluating KV cache eviction selectors in large language models, identifying three failure modes and demonstrating that value-aware ranking combined with evidence recovery achieves 72.6% accuracy on positive-margin test cases. The work addresses a critical bottleneck in long-context LLM inference by revealing why compression strategies succeed or fail.