18 articles tagged with #kv-cache. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 2d ago7/10
๐ง IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.
AIBullisharXiv โ CS AI ยท 3d ago7/10
๐ง Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.
AIBullisharXiv โ CS AI ยท Mar 267/10
๐ง Researchers introduce Bottlenecked Transformers, a new architecture that improves AI reasoning by up to 6.6 percentage points through periodic memory consolidation inspired by brain processes. The system uses a Cache Processor to rewrite key-value cache entries at reasoning step boundaries, achieving better performance on math reasoning benchmarks compared to standard Transformers.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.
AINeutralarXiv โ CS AI ยท Mar 177/10
๐ง Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.
AIBullisharXiv โ CS AI ยท Mar 67/10
๐ง Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.
๐ข Perplexity๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.
$NEAR
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose a novel self-indexing KV cache system that unifies compression and retrieval for efficient sparse attention in large language models. The method uses 1-bit vector quantization and integrates with FlashAttention to reduce memory bottlenecks in long-context LLM inference.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose OxyGen, a unified KV cache management system for Vision-Language-Action Models that enables efficient multi-task parallelism in embodied AI agents. The system achieves up to 3.7x speedup by sharing computational resources across tasks and eliminating redundant processing of shared observations.
AIBullisharXiv โ CS AI ยท Mar 126/10
๐ง Researchers have developed LookaheadKV, a new framework that significantly improves memory efficiency in large language models by intelligently evicting less important cached data. The method achieves superior accuracy while reducing computational costs by up to 14.5x compared to existing approaches, making long-context AI tasks more practical.
AIBullisharXiv โ CS AI ยท Mar 36/104
๐ง OrbitFlow is a new KV cache management system for long-context LLM serving that uses adaptive memory allocation and fine-grained optimization to improve performance. The system achieves up to 66% better SLO attainment and 3.3x higher throughput by dynamically managing GPU memory usage during token generation.
AIBullisharXiv โ CS AI ยท Mar 36/103
๐ง Researchers have introduced PiKV, an open-source KV cache management framework designed to optimize memory and communication costs for Mixture of Experts (MoE) language models across multi-GPU and multi-node inference. The system uses expert-sharded storage, intelligent routing, adaptive scheduling, and compression to improve efficiency in large-scale AI model deployment.
AIBullisharXiv โ CS AI ยท Mar 27/1011
๐ง Researchers from PKU-SEC-Lab have developed KEEP, a new memory management system that significantly improves the efficiency of AI-powered embodied planning by optimizing KV cache usage. The system achieves 2.68x speedup compared to text-based memory methods while maintaining accuracy, addressing a key bottleneck in memory-augmented Large Language Models for complex planning tasks.
AIBullisharXiv โ CS AI ยท Feb 276/106
๐ง Researchers introduce SideQuest, a novel KV cache management system that uses Large Reasoning Models to compress memory usage during long-horizon AI tasks. The system reduces peak token usage by up to 65% while maintaining accuracy by having the model itself determine which tokens are useful to keep in memory.
AINeutralHugging Face Blog ยท Jun 44/108
๐ง The article discusses the implementation of KV (Key-Value) cache mechanisms in nanoVLM, a lightweight vision-language model framework. This technical implementation focuses on optimizing memory usage and inference speed for multimodal AI applications.