Researchers propose a modification to log-linear attention mechanisms that learns adaptive memory decay parameters directly from input data rather than using fixed values. This approach maintains logarithmic memory growth and log-linear computational complexity while improving long-range context retention, particularly in language modeling and selective recall tasks.
This research addresses a fundamental architectural limitation in modern sequence models. Log-linear attention mechanisms represent a promising middle ground between transformers' expensive quadratic complexity and linear models' severe context compression, but their fixed decay parameters fail to adapt to varying input characteristics. The proposed solution leverages a lightweight two-layer MLP to generate per-token, per-level decay weights, enabling the model to dynamically adjust memory retention based on content rather than rigid positional hierarchies.
The innovation builds on established attention theory and hierarchical memory structures. Previous work identified that compression-based approaches sacrifice recall capability, while the Fenwick tree hierarchy in log-linear attention provides an elegant solution at reduced computational cost. However, uniform decay across hierarchy levels wastes modeling capacity when different contexts require different memory retention patterns. Input-dependent decay removes this bottleneck by allowing the model to learn which information deserves preservation.
For machine learning practitioners, this advancement improves efficiency in long-context applications without architectural overhead. The use of softplus activation instead of softmax prevents inter-level competition, enabling independent scaling across hierarchy levels. Evaluation results demonstrate consistent improvements on associative recall and selective copying tasks, with particularly strong gains where baseline models' fixed parameters degrade. The negligible parameter overhead means integration into existing systems requires minimal changes.
Looking forward, this work opens investigation into adaptive mechanisms across other fixed-parameter designs in sequence modeling. Future research might explore learned decay patterns in other hierarchical attention variants or apply similar principles to state-space models. The approach suggests that many seemingly fundamental tradeoffs in efficient transformers may yield to lightweight learned modifications.
- βAdaptive decay parameters learned via MLP improve log-linear attention while maintaining log-linear computational complexity.
- βInput-dependent memory weights consistently outperform fixed decay baselines, especially on long-range tasks.
- βSoftplus activation enables independent per-level scaling without the inter-level competition that softmax introduces.
- βThe modification adds negligible parameter overhead compared to baseline log-linear attention implementations.
- βResults validate that rigid architectural constraints in efficient attention can be relaxed through learned adaptations.