y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

arXiv – CS AI|Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov|
🤖AI Summary

Researchers investigate feature stability in sparse autoencoders (SAEs), finding that unstable features across training runs concentrate in reproducible lower-rank subspaces rather than representing pure noise. Stable features carry most functional signal for reconstruction and prediction, while unstable features have minimal individual impact but reflect shared geometric structure that different seeds resolve differently.

Analysis

This research addresses a fundamental challenge in neural network interpretability: whether features learned by sparse autoencoders remain consistent across independent training runs. The study reveals that seed dependence in SAEs reflects a deeper structural phenomenon than previously understood. Rather than unstable features being failed or corrupted latents, they represent alternative bases for encoding the same underlying low-dimensional information space.

The findings emerge from extensive empirical analysis across multiple models, layers, and dictionary sizes, combined with controlled synthetic experiments that isolate the mechanism. The researchers demonstrate that unstable features, while individually non-reproducible, cluster in reproducible subspaces, suggesting SAEs encounter basis ambiguity—multiple valid ways to decompose the same activation space. This distinction matters because it reframes the stability problem: the issue isn't that SAEs fail to capture meaningful structure, but that identical structure admits multiple valid SAE representations.

For practitioners building interpretability tools, these results suggest that feature stability should be assessed at the subspace level rather than treating individual latent instability as evidence of failure. The ability to construct more stable SAEs by pooling cross-seed features while preserving explained variance offers a practical solution. However, this also complicates the interpretability narrative—if multiple SAE bases equally represent the same underlying structure, which one correctly explains model behavior? The research implies that functional impact, measured through reconstruction and prediction metrics, provides a more reliable guide than surface-level feature stability.

Key Takeaways
  • Unstable SAE features concentrate in reproducible lower-rank subspaces, indicating basis ambiguity rather than pure noise
  • Stable features dominate reconstruction and prediction tasks while unstable features have minimal marginal functional impact
  • Seed dependence reflects the inherent non-uniqueness of sparse decompositions within shared activation space regions
  • Pooling unique cross-seed features enables construction of more stable SAEs without sacrificing explained variance
  • Functional impact metrics provide more reliable feature evaluation than individual feature stability across training runs
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles