When Do Attention Circuits Form? Developmental Trajectories of Capability and Attention-Sink Emergence Across Three 1B-ClassArchitectures
Researchers tracked how attention-head circuits form during training across three 1B-parameter language models, revealing that induction circuits and attention-sink circuits emerge as separate phenomena separated by an order of magnitude in training tokens. The study identifies architectural properties (zero BOS-heads in early layers) and demonstrates that circuit identification requires only 0.3-2% of total training data, offering insights into mechanistic interpretability of transformer models.
This mechanistic interpretability study advances understanding of how transformer circuits develop during pretraining, moving beyond treating capability emergence as a single unified phase transition. The researchers' systematic tracking across multiple architectures and datasets reveals that attention-sink formation and capability-circuit formation follow distinct developmental timelines, with induction heads emerging substantially before BOS-attractor heads stabilize in DCLM-trained models. This separation challenges simplified phase-transition narratives and suggests language model development involves multiple, asynchronous capability acquisitions.
The work builds on established mechanistic interpretability frameworks but extends them to developmental questions: when and how do specific circuit types crystallize during training? The discovery that certain architectural properties—like the L0/L1 zero-BOS floor—represent hard constraints rather than learned behaviors has implications for model design. The finding that circuits can be identified using only early-training checkpoints significantly reduces computational costs for circuit discovery research, democratizing mechanistic interpretability studies.
For AI developers and researchers, these results provide actionable insights for model architecture design and training optimization. Understanding that circuits emerge at different phases allows for targeted interventions during pretraining to potentially influence capability development. The reproducibility across different architectures and training corpora suggests these patterns represent fundamental properties of transformer learning dynamics rather than dataset-specific artifacts. Future work might exploit this timeline separation to understand causal relationships between circuits and to develop more interpretable training procedures.
- →Induction and attention-sink circuits form as separate transitions separated by 10-20x in training tokens, not as a single phase transition
- →Architectural properties like zero-BOS heads in early layers are structural constraints, not learned features
- →Circuit identification stabilizes within just 0.3-2% of total training tokens, enabling efficient mechanistic analysis
- →BOS-attractor emergence follows different shapes across models: gradual ramps in Pythia/OLMoE but sharp phase transition in OLMo
- →Elevated participation-ratio spectral signals predict induction head formation before capability thresholds are crossed