AIBullisharXiv โ CS AI ยท 16h ago7/10
๐ง
Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection
Researchers propose asymmetric transformer attention where keys use fewer dimensions than queries and values, achieving 75% key cache reduction with minimal quality loss. The technique enables 60% more concurrent users for large language models by saving 25GB of KV cache per user for 7B parameter models.
๐ข Perplexity