AINeutralarXiv โ CS AI ยท 14h ago6/10
๐ง
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
Researchers develop the first unified theoretical framework for sparse dictionary learning (SDL) methods used in AI interpretability, proving these optimization problems are piecewise biconvex and characterizing why they produce flawed features. The work explains long-standing practical failures in sparse autoencoders and proposes feature anchoring as a solution to improve feature disentanglement in neural networks.