y0news
AnalyticsDigestsRSSAICrypto
#cache-reduction1 article
1 articles
AIBullisharXiv โ€“ CS AI ยท 16h ago7/10
๐Ÿง 

Thin Keys, Full Values: Reducing KV Cache via Low-Dimensional Attention Selection

Researchers propose asymmetric transformer attention where keys use fewer dimensions than queries and values, achieving 75% key cache reduction with minimal quality loss. The technique enables 60% more concurrent users for large language models by saving 25GB of KV cache per user for 7B parameter models.

๐Ÿข Perplexity