ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions
Researchers introduce Residualized Sparse Autoencoders (ReSAEs), a new technique that improves how transformer models are analyzed and modified by accounting for information flow across multiple layers. By training autoencoders on residual activations rather than raw activations, ReSAEs reduce redundancy and better preserve model functionality during multi-layer interventions.
Sparse autoencoders have become essential tools for mechanistic interpretability research, allowing scientists to decompose transformer activations into interpretable features. However, traditional single-layer training approaches ignore a fundamental challenge: transformer layers are deeply interdependent, with each layer building upon previous computations. ReSAEs address this architectural reality by fitting affine maps between layers and training each subsequent autoencoder on only the unexplained residual information, effectively removing linearly predictable cross-layer structure.
This advancement matters because previous multi-layer interventions produced unpredictable interactions when modifying several layers simultaneously. When different layerwise dictionaries independently represent the same information flowing through residual streams, interventions at one layer can interact counterintuitively with changes at another. The residualization approach solves this by ensuring each layer's dictionary focuses only on novel information, creating cleaner compositional behavior.
For AI safety and interpretability research, ReSAEs provide more reliable tools for understanding and controlling transformer behavior across depth. Experiments on Pythia-1.4B and Gemma-2-9B demonstrate that despite reconstructing less raw variance, ReSAEs better preserve the activation components critical to downstream computation. This suggests mechanistic understanding requires looking beyond surface-level activation statistics to identify functionally relevant features.
Looking ahead, these findings likely influence how interpretability researchers conduct multi-layer probing experiments and perform model interventions. The technique establishes residualization as a default consideration for SAE-based analysis, potentially becoming standard practice for larger, more complex models where cross-layer structure becomes increasingly pronounced.
- βReSAEs reduce decoder redundancy by training on residual activations rather than full layer outputs, eliminating redundant information representation across layers.
- βMulti-layer interventions using ReSAEs show improved transformer cross-entropy recovery despite reconstructing less raw activation variance.
- βThe technique removes linearly predictable structure between layers, creating more predictable compositional behavior for simultaneous multi-layer modifications.
- βResidualization preserves components most relevant to model computation while filtering out unnecessary information transfer between layers.
- βResults demonstrate effectiveness on both Pythia-1.4B and Gemma-2-9B models, suggesting broad applicability across transformer architectures.