AIBullisharXiv โ CS AI ยท 5d ago7/105
๐ง
Long-Context Generalization with Sparse Attention
Researchers introduce ASEntmax, a new attention mechanism for transformer models that uses sparse attention with learnable temperature parameters. This approach significantly outperforms traditional softmax attention, achieving up to 1000x length extrapolation on synthetic tasks and better long-context performance in language modeling.