#cache-optimization News & Analysis

5 articles tagged with #cache-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

WhiFlash introduces a novel speculative decoding method that combines autoregressive and diffusion-based drafting models through token-level routing, achieving up to 69.6% throughput improvements over existing approaches. The system uses lightweight controllers to dynamically switch between drafting paradigms based on per-token conditions, addressing a key bottleneck in LLM inference efficiency.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Leyline: KV Cache Directives for Agentic Inference

Leyline introduces a new serving-side primitive for managing KV cache in agentic LLMs, enabling efficient content editing and removal without full re-computation. The system uses declarative directives and RoPE-rotation corrections to handle policy-driven cache modifications, improving cache efficiency by 11.2 percentage points and agent solve rates by 14.3 percentage points.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Recency/Frequency Adaptive KV Caching for Large Language Model Serving

Researchers propose an adaptive key-value caching strategy for large language models that dynamically allocates cache space based on recency and frequency patterns, improving upon traditional LRU eviction policies. The approach demonstrates up to 10.8% improvement in cache hit rates and 12.6% reduction in time-to-first-token on synthetic workloads, with more modest gains on real-world conversation data.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

Researchers propose Semantic Cache Distillation (SCD), a technical framework that significantly reduces communication overhead in large language model inference by replacing raw Key-Value cache transmission with compact semantic codes. The method achieves up to 2.65x speedup in time-to-first-token while maintaining generation quality within 5% of baseline performance, addressing a critical bottleneck in disaggregated LLM serving architectures.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Hybrid Verified Decoding: Learning to Allocate Verification in Speculative Decoding

Researchers propose Hybrid Verified Decoding, a technique that improves LLM inference speed by intelligently choosing between cache-based and model-based token drafting methods. The approach predicts draft acceptance rates before verification, achieving 2.73x average speedup on agentic workflows and outperforming existing methods like EAGLE3.