y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#kv-cache News & Analysis

18 articles tagged with #kv-cache. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

18 articles
AIBullisharXiv โ€“ CS AI ยท 2d ago7/10
๐Ÿง 

IceCache: Memory-efficient KV-cache Management for Long-Sequence LLMs

IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.

AIBullisharXiv โ€“ CS AI ยท 3d ago7/10
๐Ÿง 

CSAttention: Centroid-Scoring Attention for Accelerating LLM Inference

Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.

AIBullisharXiv โ€“ CS AI ยท Mar 267/10
๐Ÿง 

Bottlenecked Transformers: Periodic KV Cache Consolidation for Generalised Reasoning

Researchers introduce Bottlenecked Transformers, a new architecture that improves AI reasoning by up to 6.6 percentage points through periodic memory consolidation inspired by brain processes. The system uses a Cache Processor to rewrite key-value cache entries at reasoning step boundaries, achieving better performance on math reasoning benchmarks compared to standard Transformers.

AIBullisharXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

RelayCaching: Accelerating LLM Collaboration via Decoding KV Cache Reuse

Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.

AINeutralarXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

Accelerating Suffix Jailbreak attacks with Prefix-Shared KV-cache

Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.

AIBullisharXiv โ€“ CS AI ยท Mar 117/10
๐Ÿง 

ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.

AIBullisharXiv โ€“ CS AI ยท Mar 67/10
๐Ÿง 

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.

๐Ÿข Perplexity๐Ÿง  Llama
AIBullisharXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

FreeKV: Boosting KV Cache Retrieval for Efficient LLM Inference

Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.

$NEAR
AIBullisharXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

Researchers propose a novel self-indexing KV cache system that unifies compression and retrieval for efficient sparse attention in large language models. The method uses 1-bit vector quantization and integrates with FlashAttention to reduce memory bottlenecks in long-context LLM inference.

AIBullisharXiv โ€“ CS AI ยท Mar 126/10
๐Ÿง 

LookaheadKV: Fast and Accurate KV Cache Eviction by Glimpsing into the Future without Generation

Researchers have developed LookaheadKV, a new framework that significantly improves memory efficiency in large language models by intelligently evicting less important cached data. The method achieves superior accuracy while reducing computational costs by up to 14.5x compared to existing approaches, making long-context AI tasks more practical.

AIBullisharXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

OrbitFlow: SLO-Aware Long-Context LLM Serving with Fine-Grained KV Cache Reconfiguration

OrbitFlow is a new KV cache management system for long-context LLM serving that uses adaptive memory allocation and fine-grained optimization to improve performance. The system achieves up to 66% better SLO attainment and 3.3x higher throughput by dynamically managing GPU memory usage during token generation.

AIBullisharXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

PiKV: KV Cache Management System for Mixture of Experts

Researchers have introduced PiKV, an open-source KV cache management framework designed to optimize memory and communication costs for Mixture of Experts (MoE) language models across multi-GPU and multi-node inference. The system uses expert-sharded storage, intelligent routing, adaptive scheduling, and compression to improve efficiency in large-scale AI model deployment.

AIBullisharXiv โ€“ CS AI ยท Mar 27/1011
๐Ÿง 

KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning

Researchers from PKU-SEC-Lab have developed KEEP, a new memory management system that significantly improves the efficiency of AI-powered embodied planning by optimizing KV cache usage. The system achieves 2.68x speedup compared to text-based memory methods while maintaining accuracy, addressing a key bottleneck in memory-augmented Large Language Models for complex planning tasks.

AIBullisharXiv โ€“ CS AI ยท Feb 276/106
๐Ÿง 

SideQuest: Model-Driven KV Cache Management for Long-Horizon Agentic Reasoning

Researchers introduce SideQuest, a novel KV cache management system that uses Large Reasoning Models to compress memory usage during long-horizon AI tasks. The system reduces peak token usage by up to 65% while maintaining accuracy by having the model itself determine which tokens are useful to keep in memory.

AINeutralHugging Face Blog ยท Jun 44/108
๐Ÿง 

KV Cache from scratch in nanoVLM

The article discusses the implementation of KV (Key-Value) cache mechanisms in nanoVLM, a lightweight vision-language model framework. This technical implementation focuses on optimizing memory usage and inference speed for multimodal AI applications.