#kv-cache News & Analysis

41 articles tagged with #kv-cache. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

41 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers

Researchers propose Keyless Attention, a transformer mechanism that eliminates key projections to reduce KV cache memory by 50% while maintaining or improving performance across multiple model architectures. The approach introduces a value-space routing matrix that replaces the traditional key projection, demonstrating competitive results on perplexity and downstream benchmarks.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Jun 237/10

🧠

Kamera: Unified Position-Invariant Multimodal KV Cache for Training-Free Reuse

Researchers introduce Kamera, a training-free method that enables efficient reuse of cached key-value pairs in multimodal AI models regardless of position in the context window. By storing small low-rank conditioning patches alongside position-free chunks, the system maintains accuracy for complex multi-hop reasoning tasks while reducing computational overhead—particularly benefiting video and vision-heavy applications.

AIBullisharXiv – CS AI · Jun 117/10

🧠

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

Researchers introduce TASM (Task-Aware Structured Memory), a training-free framework that optimizes how multi-modal large language models compress and retrieve information during in-context learning. The method addresses critical scalability limitations by using task-aware compression, structure-preserving token merging, and dynamic memory hierarchies to maintain performance while reducing computational costs.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Dynamic Thinking-Token Selection for Efficient Reasoning in Large Reasoning Models

Researchers introduce Dynamic Thinking-Token Selection (DynTS), a method that optimizes Large Reasoning Models by identifying and retaining only decision-critical tokens during inference while discarding redundant reasoning trace data. This approach significantly reduces memory footprint and computational overhead, addressing a major efficiency bottleneck in LRMs that generate extended reasoning sequences.

AIBullisharXiv – CS AI · Jun 57/10

🧠

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

QCFuse introduces a compressed-view query-aware selector for retrieval-augmented generation (RAG) systems that accelerates LLM serving by intelligently reusing cached key-value computations. The technique achieves 1.7x speedup over full prefill and 1.5x over existing baselines while maintaining full-prefill quality, addressing a critical bottleneck in RAG deployment.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Exact Linear Attention

Researchers introduce Exact Linear Attention (ELA), a novel Transformer mechanism that achieves linear computational complexity while eliminating approximation errors in attention calculations. The approach demonstrates significant practical improvements including 6x faster decoding speeds and 75% reduction in KV cache memory, with extensions to vision models showing 4.3x GPU speedup.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Leyline: KV Cache Directives for Agentic Inference

Leyline introduces a new serving-side primitive for managing KV cache in agentic LLMs, enabling efficient content editing and removal without full re-computation. The system uses declarative directives and RoPE-rotation corrections to handle policy-driven cache modifications, improving cache efficiency by 11.2 percentage points and agent solve rates by 14.3 percentage points.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs

Grokers introduces an architecture that shifts AI comprehension costs from query time to write time by using autonomous agents to pre-analyze and enrich typed knowledge graphs, eliminating repeated language model calls through inductive dependency traversal. The system proves three formal theorems about cache efficiency, interaction resolution, and correct traversal ordering while providing a deterministic alternative to embedding-based search.

AIBullisharXiv – CS AI · May 287/10

🧠

A Policy-Driven Runtime Layer for Agentic LLM Serving

Researchers propose a new runtime layer architecture for serving multi-agent LLM systems, positioned between application frameworks and inference engines. The approach enables unified policy management for cross-cutting concerns like caching and fairness, with CacheSage demonstrating 13-37% improvements in cache hit rates and 12-29% reductions in time-to-first-token latency.

AIBullisharXiv – CS AI · May 127/10

🧠

Key-Value Means

Researchers introduce Key-Value Means (KVM), a novel attention mechanism that bridges traditional transformers and linear RNNs by supporting both fixed-size and growing state with linear time complexity. The approach achieves competitive long-context performance while reducing KV-cache memory requirements and enabling flexible prefill time complexity between O(N) and O(N²).

🏢 Hugging Face

AIBullisharXiv – CS AI · May 117/10

🧠

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

Researchers introduce LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem from head-wise weight averaging to output-aware layer-wise matrix multiplication. The method achieves 2× accuracy loss reduction under extreme compression while maintaining performance with just 5% of the original KV cache.

AIBullisharXiv – CS AI · May 117/10

🧠

CSR: Infinite-Horizon Real-Time Policies with Massive Cached State Representations

Researchers introduce Cached State Representation (CSR), a framework that reduces latency in deploying large language models for robotics by 26-fold through optimized token caching and asynchronous state management. The approach enables real-time robot control with massive language models while maintaining full contextual understanding over infinite operational horizons.

AIBullisharXiv – CS AI · May 77/10

🧠

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints

Researchers introduce a queueing-theoretic framework that models LLM inference stability by accounting for both computational and GPU memory constraints from KV caching. The framework derives conditions for service stability and enables operators to calculate optimal cluster sizes for efficient GPU provisioning, with experimental validation showing predictions within 10% accuracy.

AIBullisharXiv – CS AI · May 47/10

🧠

SAGA: Workflow-Atomic Scheduling for AI Agent Inference on GPU Clusters

SAGA is a new distributed GPU scheduler that treats entire AI agent workflows as atomic units rather than individual inference calls, reducing task completion time by 1.64x compared to existing solutions. The system achieves this through workflow-aware scheduling, KV cache optimization, and fairness mechanisms, though with a tradeoff of 30% lower peak throughput suitable for latency-sensitive interactive deployments.

🏢 Meta

AIBullisharXiv – CS AI · Apr 147/10

🧠

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.

AIBullisharXiv – CS AI · Apr 137/10

🧠

CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.

AIBullisharXiv – CS AI · Mar 267/10

🧠

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Researchers introduce Bottlenecked Transformers, a new architecture that improves AI reasoning by up to 6.6 percentage points through periodic memory consolidation inspired by brain processes. The system uses a Cache Processor to rewrite key-value cache entries at reasoning step boundaries, achieving better performance on math reasoning benchmarks compared to standard Transformers.

AIBullisharXiv – CS AI · Mar 177/10

🧠

RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse

Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.

AINeutralarXiv – CS AI · Mar 177/10

🧠

Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache

Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.

AIBullisharXiv – CS AI · Mar 117/10

🧠

Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention

Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.

AIBullisharXiv – CS AI · Mar 117/10

🧠

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.

AIBullisharXiv – CS AI · Mar 67/10

🧠

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.

🏢 Perplexity🧠 Llama

AIBullisharXiv – CS AI · Mar 37/103

🧠

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.

$NEAR

AIBullisharXiv – CS AI · Mar 37/103

🧠

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Blurry Window Attention

Researchers introduce Blurry Window Attention (BLA), a novel attention mechanism that addresses the quadratic complexity and memory limitations of traditional Transformer models by reconstructing sparse key-value history through Dirichlet kernel interpolation. BLA demonstrates 8x state efficiency improvements over sliding window attention while maintaining competitive performance on information retrieval tasks, positioning it as a viable alternative for long-context language modeling.

🏢 Perplexity

Page 1 of 2Next →