Do Sparse Autoencoders Capture Concept Manifolds?
Researchers demonstrate that sparse autoencoders (SAEs) capture semantic concepts along low-dimensional manifolds rather than isolated linear directions, revealing that existing architectures suboptimally recover these continuous structures through a fragmented approach called dilution. The findings suggest future interpretability methods should treat geometric objects as fundamental units rather than individual feature directions.
This research addresses a fundamental gap between how neural network interpretability tools are designed and how concepts actually organize themselves in learned representations. Sparse autoencoders have become central to mechanistic interpretability efforts, enabling researchers to decompose complex neural activations into human-readable features. However, this work reveals these tools operate under a flawed assumption: that concepts map to independent linear directions. The authors demonstrate concepts instead cluster along geometric manifolds—continuous, structured spaces where relationships between concepts have mathematical meaning.
The theoretical framework distinguishes two potential strategies for capturing manifolds: global allocation, where neurons collectively span the entire manifold structure, and local tiling, where individual neurons selectively activate in restricted regions. Empirically, SAEs employ both strategies inefficiently, creating what researchers term 'dilution'—a fragmented regime that obscures manifold structure at the individual feature level. This explains why current interpretability methods rarely reveal continuous concept relationships, fundamentally limiting our understanding of how neural networks encode semantic information.
For the interpretability and AI safety communities, these findings carry significant implications. Current approaches to understanding neural networks may miss critical organizational principles in learned representations. The research suggests moving beyond neuron-level analysis toward discovering coherent groupings of features that jointly encode geometric concepts. This shift could enable more robust interpretability methods and potentially improve our ability to audit model behavior. The work motivates development of post-hoc discovery algorithms that identify feature clusters rather than isolated directions, redirecting interpretability research toward group-level analysis.
- →SAEs capture manifold structures suboptimally by mixing global and local solutions, creating dilution that fragments manifold geometry across features
- →Concepts organize along low-dimensional manifolds encoding continuous relationships, not independent linear directions as traditionally assumed
- →Future interpretability methods should treat geometric objects and feature groupings as fundamental units rather than analyzing individual neurons in isolation
- →Post-hoc unsupervised discovery methods searching for coherent feature clusters could reveal manifold structure invisible in individual concept analysis
- →This framework reconciles theoretical assumptions about neural representations with empirical evidence of geometric concept organization