Keyless Attention: Value-Space Routing and Value-Only Caching for Efficient Transformers
Researchers propose Keyless Attention, a transformer mechanism that eliminates key projections to reduce KV cache memory by 50% while maintaining or improving performance across multiple model architectures. The approach introduces a value-space routing matrix that replaces the traditional key projection, demonstrating competitive results on perplexity and downstream benchmarks.
Keyless Attention addresses a critical bottleneck in transformer inference: the exponential growth of key-value cache memory during decoding. By removing the key projection entirely and operating only on queries and values, the mechanism achieves a straightforward 50% reduction in KV cache overhead—a direct improvement for any system deploying large language models. This is particularly significant for production environments where inference memory costs dominate computational budgets and limit batch sizes.
The innovation builds on established transformer theory by reframing attention as a factorization problem. Standard attention represents a depth-2 factorization of the attention bilinear form, while Keyless Attention enables depth-m variants. At m=3, the approach maintains computational parity with standard attention while introducing a coupling mechanism between value-space routing and retrieval. This theoretical framing suggests the authors discovered a fundamental property rather than applying a simple heuristic.
Experimental validation across five models and four architectures demonstrates practical viability. Matching or exceeding perplexity on four of five models, with stronger zero-shot reasoning performance on downstream benchmarks, indicates the method captures meaningful semantic relationships despite the architectural simplification. The consistent 50% KV cache reduction without performance degradation represents a valuable efficiency gain for production deployments.
The research has immediate implications for edge deployment, serving multiple users with constrained GPU memory, and reducing latency bottlenecks in bandwidth-limited inference pipelines. However, adoption requires integration into existing transformer implementations and retraining, which may limit near-term uptake despite the technical advantages.
- →Keyless Attention achieves exactly 50% KV cache memory reduction by eliminating key projections entirely.
- →The mechanism matches or outperforms standard QKV attention on perplexity across most tested models without computational overhead.
- →Introduces depth-m attention factorization framework, revealing a theoretical generalization of standard transformer attention.
- →Delivers superior zero-shot reasoning performance on commonsense benchmarks while maintaining efficiency gains.
- →Offers direct practical benefits for memory-constrained inference deployments and multi-user serving scenarios.