Cross-Layer Discrete Concept Discovery for Interpreting Language Models
Researchers introduce CLVQ-VAE, a novel framework for interpreting language models by discovering discrete, interpretable concepts across layers. The method outperforms existing approaches by collapsing duplicated features in residual streams into compact concept vectors, achieving 93% accuracy drops when concepts are removed and 78% human prediction recovery from visualizations.
Language model interpretability has emerged as a critical research frontier as these systems grow increasingly complex and integrated into high-stakes applications. The challenge of understanding how information flows through neural networks—particularly across layers where features duplicate and mix—has limited researchers' ability to identify what these models actually learn. CLVQ-VAE represents a meaningful advance by introducing a discrete vector-quantization bottleneck that transforms continuous, distributed representations into interpretable concept vectors, addressing a fundamental limitation of prior approaches that operated in continuous space where concepts remain diffuse and difficult to isolate.
This work builds on established interpretability research, including sparse autoencoders and vector quantization techniques, but combines them in a novel cross-layer architecture. The exponential moving average codebook updates and temperature-based sampling provide mechanisms to explore discrete latent spaces while preserving diversity—a technical innovation that balances exploration and stability. The empirical validation across ERASER-Movie, Jigsaw, and AGNews datasets demonstrates broad applicability to both encoder and decoder architectures.
The practical implications are substantial for AI safety and trustworthiness. When concept removal reduces model accuracy by up to 93%, this provides direct evidence of causal relationships between identified features and model behavior—moving beyond correlation toward mechanistic understanding. Human annotators recovering predictions with 78% accuracy (versus 54% for clustering) indicates these discrete concepts genuinely capture human-interpretable patterns rather than statistical artifacts. For AI developers and safety researchers, this methodology enables more rigorous auditing, bias detection, and failure mode analysis. As language models deploy in regulated industries and critical infrastructure, interpretability tools that scale across model architectures become increasingly valuable for compliance, debugging, and risk mitigation.
- →CLVQ-VAE discovers discrete, interpretable concepts across language model layers by collapsing duplicated residual stream features through vector quantization.
- →Removing identified concepts degrades model accuracy by up to 93%, providing causal evidence of feature importance.
- →Human annotators achieve 78% accuracy recovering model predictions from CLVQ-VAE visualizations versus 54% for clustering baselines.
- →The framework combines top-k sampling with exponential moving average codebook updates to balance exploration and diversity in discrete latent spaces.
- →Results demonstrate consistent improvements across encoder-decoder architectures and multiple datasets, suggesting broad applicability for AI interpretability research.