AINeutralarXiv – CS AI · 3h ago6/10
🧠
Singular Vectors of Attention Heads Align with Features
Researchers demonstrate that singular vectors of attention matrices in language models reliably align with learned feature representations, providing theoretical justification for using this mathematical approach to identify interpretable features. The work bridges mechanistic interpretability research by validating why this alignment occurs and proposing testable predictions for detecting it in real models.