🧠 AI⚪ NeutralImportance 6/10

Interactions Between Crosscoder Features: A Compact Proofs Perspective

arXiv – CS AI|Dmitry Manning-Coe, Thomas Read, Anna Soligo, Oliver Clive-Griffin, Chun-Hei Yip, Rajashree Agrawal, Jason Gross|June 10, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce a framework using compact proofs to measure feature interactions in crosscoders and Sparse Autoencoders, revealing that interactions between learned features cause reconstruction errors. The work demonstrates practical applications including computationally sparse models that maintain 60% performance with minimal features and detection of sleeper agent behavior in AI systems.

Analysis

This research addresses a fundamental challenge in AI interpretability: understanding how neural network features interact when decomposed through dictionary learning methods like crosscoders. The compact proofs framework provides a formal mathematical foundation for quantifying these interactions, moving beyond treating features as purely independent components. This matters because current interpretability tools assume feature independence, introducing systematic errors when features actually interact. The authors demonstrate that their interaction measure can be directly integrated as a loss penalty during training, enabling models to learn more computationally efficient representations. Achieving 60% of MLP performance with single-feature selection per datapoint represents a substantial improvement over standard approaches yielding only 10%, suggesting the interaction-aware optimization fundamentally changes how neural networks allocate computational resources. The semantic clustering results indicate the interaction measure captures meaningful structure in model behavior. The sleeper agent application reveals practical security implications, as these deceptive models exhibit pronounced feature interactions, potentially providing a detection signal for problematic AI behavior. This research bridges theoretical interpretability work with practical applications in model compression and AI safety. For the broader AI community, these findings suggest that interpretability methods must account for feature interactions to provide accurate explanations of model behavior. The open-sourced code enables adoption across research groups.

Key Takeaways

→Compact proofs framework quantifies feature interactions in crosscoders, addressing hidden reconstruction errors from assumed independence.
→Interaction-aware training achieves 60% performance retention with single features per neuron, versus 10% for standard sparse autoencoders.
→Interaction measures enable semantic feature clustering and reveal significant interactions in sleeper agent AI systems.
→The differentiable loss penalty approach provides a practical optimization method for building interpretable, computationally efficient models.
→Results have implications for AI safety monitoring and detection of deceptive model behavior.