AINeutralarXiv – CS AI · 4h ago6/10
🧠
Why Do Accumulated Transformations Extrapolate?
Researchers demonstrate that accumulated data-dependent transformations in transformer attention mechanisms enable better length extrapolation than fixed position encodings like RoPE, though performance eventually degrades at extreme context lengths. The improvement stems from learned token-dependent rotations creating finite mixing windows that suppress distant tokens while preserving near-range signals, a principle applicable across orthogonal transformations rather than specific techniques.
🏢 Perplexity