Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
Researchers have developed a method using sparse crosscoders to track how large language models learn linguistic concepts during training, introducing a new metric called Relative Indirect Effects (RelIE) to identify when specific features become causally important. This approach provides interpretable, fine-grained visibility into representation learning throughout pretraining, advancing understanding of how LLMs acquire abstract capabilities.
This research addresses a fundamental gap in AI interpretability: understanding not just what LLMs can do, but how they acquire specific linguistic capabilities during training. Traditional benchmarking reveals performance metrics but obscures the underlying mechanisms of feature emergence. By deploying sparse crosscoders across model checkpoints, researchers can now map the temporal evolution of linguistic features and identify critical training phases where abstract concepts crystallize into causal importance for task performance.
The work builds on growing momentum in mechanistic interpretability, where researchers increasingly focus on decomposing neural network behavior into interpretable components. Prior efforts like sparse autoencoders and activation patching established that trained models contain recoverable, meaningful features. This research extends that framework across time, treating pretraining as a staged process of feature discovery, consolidation, and sometimes discontinuation.
For AI developers and safety researchers, interpretable pretraining dynamics carry substantial implications. Understanding when models acquire specific capabilities enables more targeted evaluation and potentially earlier detection of emerging behaviors—whether desired capabilities or problematic failure modes. The architecture-agnostic, scalable approach means this methodology could apply across different LLM families and sizes.
Looking ahead, this technique could inform more deliberate training procedures, where practitioners understand the learning trajectory of specific linguistic or behavioral features. Integration with other interpretability methods might yield even richer models of representation learning. Success here could accelerate the transition from post-hoc analysis of trained models to principled design of training processes optimized for both capability and interpretability.
- →Sparse crosscoders enable tracking of linguistic feature emergence across model training checkpoints, providing temporal visibility into representation learning.
- →Relative Indirect Effects (RelIE) metric identifies when individual features become causally important for task performance during pretraining.
- →Method successfully detects feature emergence, maintenance, and discontinuation—mapping the complete lifecycle of learned concepts.
- →Architecture-agnostic approach scales across different LLM families, advancing practical interpretability analysis.
- →Understanding training dynamics could enable earlier detection of emerging capabilities and inform more deliberate model development practices.