Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders
Researchers investigate feature stability in sparse autoencoders (SAEs), finding that unstable features across training runs concentrate in reproducible lower-rank subspaces rather than representing pure noise. Stable features carry most functional signal for reconstruction and prediction, while unstable features have minimal individual impact but reflect shared geometric structure that different seeds resolve differently.
This research addresses a fundamental challenge in neural network interpretability: whether features learned by sparse autoencoders remain consistent across independent training runs. The study reveals that seed dependence in SAEs reflects a deeper structural phenomenon than previously understood. Rather than unstable features being failed or corrupted latents, they represent alternative bases for encoding the same underlying low-dimensional information space.
The findings emerge from extensive empirical analysis across multiple models, layers, and dictionary sizes, combined with controlled synthetic experiments that isolate the mechanism. The researchers demonstrate that unstable features, while individually non-reproducible, cluster in reproducible subspaces, suggesting SAEs encounter basis ambiguity—multiple valid ways to decompose the same activation space. This distinction matters because it reframes the stability problem: the issue isn't that SAEs fail to capture meaningful structure, but that identical structure admits multiple valid SAE representations.
For practitioners building interpretability tools, these results suggest that feature stability should be assessed at the subspace level rather than treating individual latent instability as evidence of failure. The ability to construct more stable SAEs by pooling cross-seed features while preserving explained variance offers a practical solution. However, this also complicates the interpretability narrative—if multiple SAE bases equally represent the same underlying structure, which one correctly explains model behavior? The research implies that functional impact, measured through reconstruction and prediction metrics, provides a more reliable guide than surface-level feature stability.
- →Unstable SAE features concentrate in reproducible lower-rank subspaces, indicating basis ambiguity rather than pure noise
- →Stable features dominate reconstruction and prediction tasks while unstable features have minimal marginal functional impact
- →Seed dependence reflects the inherent non-uniqueness of sparse decompositions within shared activation space regions
- →Pooling unique cross-seed features enables construction of more stable SAEs without sacrificing explained variance
- →Functional impact metrics provide more reliable feature evaluation than individual feature stability across training runs