AINeutralarXiv – CS AI · 9h ago7/10
🧠
Activation Differences Reveal Backdoors: A Comparison of SAE Architectures
Researchers demonstrate that Differential SAEs (Diff-SAE) significantly outperform Crosscoders in detecting backdoor attacks in language models, achieving a 0.40 Backdoor Isolation Score with perfect precision. The study reveals that backdoors manifest as directional activation shifts rather than sparse features, providing critical insights for AI safety monitoring and interpretability tool development.