Interactions Between Crosscoder Features: A Compact Proofs Perspective
Researchers introduce a framework using compact proofs to measure feature interactions in crosscoders and Sparse Autoencoders, revealing that interactions between learned features cause reconstruction errors. The work demonstrates practical applications including computationally sparse models that maintain 60% performance with minimal features and detection of sleeper agent behavior in AI systems.
This research addresses a fundamental challenge in AI interpretability: understanding how neural network features interact when decomposed through dictionary learning methods like crosscoders. The compact proofs framework provides a formal mathematical foundation for quantifying these interactions, moving beyond treating features as purely independent components. This matters because current interpretability tools assume feature independence, introducing systematic errors when features actually interact. The authors demonstrate that their interaction measure can be directly integrated as a loss penalty during training, enabling models to learn more computationally efficient representations. Achieving 60% of MLP performance with single-feature selection per datapoint represents a substantial improvement over standard approaches yielding only 10%, suggesting the interaction-aware optimization fundamentally changes how neural networks allocate computational resources. The semantic clustering results indicate the interaction measure captures meaningful structure in model behavior. The sleeper agent application reveals practical security implications, as these deceptive models exhibit pronounced feature interactions, potentially providing a detection signal for problematic AI behavior. This research bridges theoretical interpretability work with practical applications in model compression and AI safety. For the broader AI community, these findings suggest that interpretability methods must account for feature interactions to provide accurate explanations of model behavior. The open-sourced code enables adoption across research groups.
- βCompact proofs framework quantifies feature interactions in crosscoders, addressing hidden reconstruction errors from assumed independence.
- βInteraction-aware training achieves 60% performance retention with single features per neuron, versus 10% for standard sparse autoencoders.
- βInteraction measures enable semantic feature clustering and reveal significant interactions in sleeper agent AI systems.
- βThe differentiable loss penalty approach provides a practical optimization method for building interpretable, computationally efficient models.
- βResults have implications for AI safety monitoring and detection of deceptive model behavior.