AIBullisharXiv โ CS AI ยท 6d ago7/103
๐ง
Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Researchers propose TRIM-KV, a novel approach that learns token importance for memory-bounded LLM inference through lightweight retention gates, addressing the quadratic cost of self-attention and growing key-value cache issues. The method outperforms existing eviction baselines across multiple benchmarks and provides insights into LLM interpretability through learned retention scores.