AIBullisharXiv – CS AI · Mar 67/10
🧠
Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
Researchers propose asymmetric transformer attention where keys use fewer dimensions than queries and values, achieving 75% key cache reduction with minimal quality loss. The technique enables 60% more concurrent users for large language models by saving 25GB of KV cache per user for 7B parameter models.
🏢 Perplexity