Where does Absolute Position come from in decoder-only Transformers?
Researchers discovered that RoPE-trained transformer models encode absolute position information despite RoPE only encoding relative offsets, with the leakage originating from causal masking and residual stream components. The findings reveal how different architectural variants—NTK scaling, sliding-window attention, and standard RoPE—balance these position-encoding mechanisms differently, with attention sinks serving as token-anchored stabilizers.
This technical research addresses a fundamental question about how decoder-only transformers maintain positional awareness in their attention mechanisms. The discovery that absolute position information leaks into models despite architectural designs suggesting otherwise has implications for understanding transformer behavior and improving model efficiency.
The research traces position encoding through two distinct pathways: the causal mask creates position-dependent softmax denominators by construction, while the residual stream preserves information from earlier tokens through a closed dynamical system at position zero. Attention sinks—specialized heads that read this trajectory—function as deterministic fingerprints, remaining constant when using auto-prepended beginning-of-sequence tokens but varying otherwise. This architectural insight explains why different scaling approaches yield varying results.
For the AI development community, these findings provide crucial understanding of how current large language models actually function at a mechanistic level. Developers and researchers can leverage this knowledge to optimize model architectures, potentially reducing computational overhead while maintaining performance. The distinction between how NTK scaling suppresses residual-stream components versus how sliding-window attention allows them to accumulate offers pathways for targeted improvements.
The research becomes particularly relevant as the field pursues more efficient transformers for deployment. Understanding these position-encoding mechanisms enables researchers to design architectures with intentional trade-offs between computational cost and capability. Future work likely focuses on exploiting these mechanisms for more efficient long-context models and investigating whether similar leakage patterns occur in encoder-decoder architectures.
- →Absolute position information leaks into RoPE-trained transformers through causal masking and residual streams despite relative-only relative offset encoding
- →Attention sinks function as token-anchored stabilizers that preserve deterministic fingerprints of position-zero tokens across the model
- →Different architectural variants balance position-encoding components differently, with NTK scaling suppressing and sliding-window attention amplifying residual-stream effects
- →Replacing BOS embeddings removes approximately 40% of the residual-stream position component at early queries
- →Mechanistic understanding of position encoding enables optimization opportunities for more efficient transformer architectures