AINeutralarXiv – CS AI · 7h ago6/10
🧠
Cross-Layer Discrete Concept Discovery for Interpreting Language Models
Researchers introduce CLVQ-VAE, a novel framework for interpreting language models by discovering discrete, interpretable concepts across layers. The method outperforms existing approaches by collapsing duplicated features in residual streams into compact concept vectors, achieving 93% accuracy drops when concepts are removed and 78% human prediction recovery from visualizations.