AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce Key-Value Means (KVM), a novel attention mechanism that bridges traditional transformers and linear RNNs by supporting both fixed-size and growing state with linear time complexity. The approach achieves competitive long-context performance while reducing KV-cache memory requirements and enabling flexible prefill time complexity between O(N) and O(N²).
🏢 Hugging Face
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce LaProx, a novel KV Cache eviction strategy for long-context LLM inference that reformulates the problem from head-wise weight averaging to output-aware layer-wise matrix multiplication. The method achieves 2× accuracy loss reduction under extreme compression while maintaining performance with just 5% of the original KV cache.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce Cached State Representation (CSR), a framework that reduces latency in deploying large language models for robotics by 26-fold through optimized token caching and asynchronous state management. The approach enables real-time robot control with massive language models while maintaining full contextual understanding over infinite operational horizons.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers introduce a queueing-theoretic framework that models LLM inference stability by accounting for both computational and GPU memory constraints from KV caching. The framework derives conditions for service stability and enables operators to calculate optimal cluster sizes for efficient GPU provisioning, with experimental validation showing predictions within 10% accuracy.
AIBullisharXiv – CS AI · May 47/10
🧠SAGA is a new distributed GPU scheduler that treats entire AI agent workflows as atomic units rather than individual inference calls, reducing task completion time by 1.64x compared to existing solutions. The system achieves this through workflow-aware scheduling, KV cache optimization, and fairness mechanisms, though with a tradeoff of 30% lower peak throughput suitable for latency-sensitive interactive deployments.
🏢 Meta
AIBullisharXiv – CS AI · Apr 147/10
🧠IceCache is a new memory management technique for large language models that reduces KV cache memory consumption by 75% while maintaining 99% accuracy on long-sequence tasks. The method combines semantic token clustering with PagedAttention to intelligently offload cache data between GPU and CPU, addressing a critical bottleneck in LLM inference on resource-constrained hardware.
AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers introduce CSAttention, a training-free sparse attention method that accelerates LLM inference by 4.6x for long-context applications. The technique optimizes the offline-prefill/online-decode workflow by precomputing query-centric lookup tables, enabling faster token generation without sacrificing accuracy even at 95% sparsity levels.
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers introduce Bottlenecked Transformers, a new architecture that improves AI reasoning by up to 6.6 percentage points through periodic memory consolidation inspired by brain processes. The system uses a Cache Processor to rewrite key-value cache entries at reasoning step boundaries, achieving better performance on math reasoning benchmarks compared to standard Transformers.
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce RelayCaching, a training-free method that accelerates multi-agent LLM systems by reusing KV cache data from previous agents to eliminate redundant computation. The technique achieves over 80% cache reuse and reduces time-to-first-token by up to 4.7x while maintaining accuracy across mathematical reasoning, knowledge tasks, and code generation.
AINeutralarXiv – CS AI · Mar 177/10
🧠Researchers developed Prefix-Shared KV Cache (PSKV), a new technique that accelerates jailbreak attacks on Large Language Models by 40% while reducing memory usage by 50%. The method optimizes the red-teaming process by sharing cached prefixes across multiple attack attempts, enabling more efficient parallel inference without compromising attack success rates.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers propose ARKV, a new framework for managing memory in large language models that reduces KV cache memory usage by 4x while preserving 97% of baseline accuracy. The adaptive system dynamically allocates precision levels to cached tokens based on attention patterns, enabling more efficient long-context inference without requiring model retraining.
AIBullisharXiv – CS AI · Mar 117/10
🧠Researchers have developed Zipage, a new high-concurrency inference engine for large language models that uses Compressed PagedAttention to solve memory bottlenecks. The system achieves 95% performance of full KV inference engines while delivering over 2.1x speedup on mathematical reasoning tasks.
AIBullisharXiv – CS AI · Mar 67/10
🧠Researchers developed a memory management system for multi-agent AI systems on edge devices that reduces memory requirements by 4x through 4-bit quantization and eliminates redundant computation by persisting KV caches to disk. The solution reduces time-to-first-token by up to 136x while maintaining minimal impact on model quality across three major language model architectures.
🏢 Perplexity🧠 Llama
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce FreeKV, a training-free optimization framework that dramatically improves KV cache retrieval efficiency for large language models with long context windows. The system achieves up to 13x speedup compared to existing methods while maintaining near-lossless accuracy through speculative retrieval and hybrid memory layouts.
$NEAR
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose Shadow Mask Distillation to address the memory bottleneck created by KV cache compression during reinforcement learning post-training of large language models. The technique tackles the critical off-policy bias that emerges when compressed contexts are used during rollout generation while full contexts are used for parameter updates, a problem that amplifies instability in RL optimization.
AIBullisharXiv – CS AI · May 116/10
🧠Fluxion, a new hybrid CPU-GPU system, optimizes long-context inference by efficiently managing key-value caches split between host and GPU memory. The approach delivers 1.5x-3.7x speedup over existing baselines while maintaining near-baseline accuracy, addressing a critical bottleneck in modern large language model deployment.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers propose a novel self-indexing KV cache system that unifies compression and retrieval for efficient sparse attention in large language models. The method uses 1-bit vector quantization and integrates with FlashAttention to reduce memory bottlenecks in long-context LLM inference.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers propose OxyGen, a unified KV cache management system for Vision-Language-Action Models that enables efficient multi-task parallelism in embodied AI agents. The system achieves up to 3.7x speedup by sharing computational resources across tasks and eliminating redundant processing of shared observations.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers have developed LookaheadKV, a new framework that significantly improves memory efficiency in large language models by intelligently evicting less important cached data. The method achieves superior accuracy while reducing computational costs by up to 14.5x compared to existing approaches, making long-context AI tasks more practical.
AIBullisharXiv – CS AI · Mar 36/104
🧠OrbitFlow is a new KV cache management system for long-context LLM serving that uses adaptive memory allocation and fine-grained optimization to improve performance. The system achieves up to 66% better SLO attainment and 3.3x higher throughput by dynamically managing GPU memory usage during token generation.
AIBullisharXiv – CS AI · Mar 36/103
🧠Researchers have introduced PiKV, an open-source KV cache management framework designed to optimize memory and communication costs for Mixture of Experts (MoE) language models across multi-GPU and multi-node inference. The system uses expert-sharded storage, intelligent routing, adaptive scheduling, and compression to improve efficiency in large-scale AI model deployment.
AIBullisharXiv – CS AI · Mar 27/1011
🧠Researchers from PKU-SEC-Lab have developed KEEP, a new memory management system that significantly improves the efficiency of AI-powered embodied planning by optimizing KV cache usage. The system achieves 2.68x speedup compared to text-based memory methods while maintaining accuracy, addressing a key bottleneck in memory-augmented Large Language Models for complex planning tasks.
AIBullisharXiv – CS AI · Feb 276/106
🧠Researchers introduce SideQuest, a novel KV cache management system that uses Large Reasoning Models to compress memory usage during long-horizon AI tasks. The system reduces peak token usage by up to 65% while maintaining accuracy by having the model itself determine which tokens are useful to keep in memory.
AINeutralHugging Face Blog · Jun 44/108
🧠The article discusses the implementation of KV (Key-Value) cache mechanisms in nanoVLM, a lightweight vision-language model framework. This technical implementation focuses on optimizing memory usage and inference speed for multimodal AI applications.