Exact Linear Attention
Researchers introduce Exact Linear Attention (ELA), a novel Transformer mechanism that achieves linear computational complexity while eliminating approximation errors in attention calculations. The approach demonstrates significant practical improvements including 6x faster decoding speeds and 75% reduction in KV cache memory, with extensions to vision models showing 4.3x GPU speedup.