#neural-network-interpretability News & Analysis

2 articles tagged with #neural-network-interpretability. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AINeutralarXiv – CS AI · Jun 116/10

🧠

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

Researchers investigate feature stability in sparse autoencoders (SAEs), finding that unstable features across training runs concentrate in reproducible lower-rank subspaces rather than representing pure noise. Stable features carry most functional signal for reconstruction and prediction, while unstable features have minimal individual impact but reflect shared geometric structure that different seeds resolve differently.

AINeutralarXiv – CS AI · May 286/10

🧠

Semantic Optimal Transport for Sparse Autoencoder Feature Matching and Circuit Compression

Researchers introduce a novel semantic distance metric for sparse autoencoders (SAEs) using distributional representations and Wasserstein distance, enabling better cross-layer feature matching and automatic circuit compression in language model interpretability research.