Query Circuits: Explaining How Language Models Answer User Prompts
Researchers introduce query circuits, a method to trace how language models process specific inputs and generate outputs by identifying sparse, faithful neural pathways within the model itself. The approach achieves significant performance recovery using only 1.3% of model connections on benchmark tasks, offering more interpretable AI explanations than existing surrogate-based methods.
Query circuits represent a meaningful advance in AI interpretability by shifting focus from understanding broad model capabilities to explaining specific input-output mappings. This addresses a critical gap where existing circuit discovery methods reveal what models can do generally but not why they make particular decisions. The research demonstrates that language models rely on surprisingly sparse internal pathways—circuits using just 1.3% of connections recover ~60% of performance on MMLU questions—suggesting models operate with significant redundancy and inefficiency.
The introduction of Normalized Deviation Faithfulness provides the field with a more rigorous evaluation metric for circuit discovery, extending beyond this single application. This methodological contribution strengthens the broader interpretability research agenda by enabling better validation of circuit identification across different architectures and tasks. The sampling-based discovery methods balance computational efficiency with accuracy, making the approach practical for real-world implementation rather than merely theoretical.
For the AI industry, improved interpretability directly addresses regulatory and safety concerns around black-box decision-making in language models. As AI systems integrate into critical applications, stakeholders increasingly demand transparent reasoning. Query circuits enable developers and auditors to verify model behavior at the input level, potentially revealing biases, hallucinations, or failure modes in specific contexts. This has implications for model debugging, safety alignment, and regulatory compliance.
Future work likely focuses on scaling these methods to larger models, applying them to multimodal systems, and integrating interpretability into model development workflows rather than as post-hoc analysis. The project's public release creates opportunities for broader adoption and refinement by the research community.
- →Query circuits identify sparse, faithful neural pathways within language models that explain specific input-output decisions.
- →Normalized Deviation Faithfulness introduces a robust metric for evaluating circuit discovery accuracy across diverse benchmarks.
- →Extremely sparse circuits (1.3% of connections) recover substantial model performance, indicating significant redundancy in neural networks.
- →The approach provides more direct, computationally efficient explanations than surrogate-based methods like sparse autoencoders.
- →Query circuits advance AI safety and regulatory compliance by enabling input-level interpretability of model decisions.