Researchers propose Kaczmarz Linear Attention (KLA), an improved algorithm for long-context language modeling that replaces empirically-learned coefficients with mathematically-derived key-norm-normalized step sizes. KLA outperforms existing linear attention baselines like Gated DeltaNet while maintaining computational efficiency and enabling stable processing of up to 65K token contexts.
The article addresses a fundamental computational bottleneck in modern AI: Transformer attention's quadratic scaling makes processing long contexts prohibitively expensive. Linear recurrent models offer an elegant alternative by maintaining a compressed fixed-size state, but designing how this state updates—what to forget, write, and edit—remains a critical unsolved problem. Previous approaches like Gated DeltaNet used learned coefficients to balance these operations, treating the problem empirically rather than theoretically grounded.
KLA represents a principled advancement by deriving its update coefficient directly from mathematical foundations, specifically the Kaczmarz projection method used in online regression. This theoretical grounding replaces guesswork with a formula: beta_t = eta_t / (||k_t||_2^2 + epsilon). The elegance lies in its simplicity—a single-scalar modification that preserves existing architecture, making integration straightforward. Empirical validation demonstrates meaningful improvements: 5.1% lower perplexity than GDN at 0.4B scale, perfect accuracy on retrieval tasks, and 7-point improvements on multi-query recall. Notably, KLA achieves 2.1x higher decode throughput, directly impacting inference costs and user experience.
For the AI infrastructure industry, this work illustrates how theoretical rigor yields practical gains without architectural overhauls or hardware changes. The stability up to 65K tokens suggests KLA could enable longer context windows—valuable for document processing, code analysis, and reasoning tasks. The research provides a roadmap for improving other learnable coefficients in sequence models by grounding them in mathematical principles rather than empirical tuning.
- →KLA achieves 8.09 validation perplexity versus 8.50 for GDN, demonstrating measurable improvements in language modeling accuracy
- →A theoretically-derived step-size formula replaces learned coefficients, improving performance without changing model architecture or computational requirements
- →Decoding throughput improves 2.1x at 32K context length, reducing inference costs for deployed models
- →Perfect performance on needle-in-haystack retrieval and 7-point gains on multi-query recall indicate stronger long-context reasoning capabilities
- →The approach scales efficiently to 65K tokens while remaining compatible with existing hardware kernels and parallel algorithms