AIBullisharXiv – CS AI · 6h ago7/10
🧠
IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference
Researchers introduce IntentKV, a learned KV cache pruning technique that optimizes memory usage for multi-turn LLM agents without modifying the base model. The method achieves 23-30% reductions in peak request tokens and up to 92.6% fewer KV reads under tight memory budgets, addressing a critical bottleneck in long-horizon agent inference.