#circuit-discovery News & Analysis

9 articles tagged with #circuit-discovery. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBearisharXiv – CS AI · Jun 17/10

🧠

Mechanistic Interpretability as Statistical Estimation: A Variance Analysis

Researchers demonstrate that mechanistic interpretability—the process of reverse-engineering AI model behaviors through circuit discovery—suffers from fundamental statistical instability due to high variance in causal mediation analysis. The findings reveal that circuit structures are fragile and highly sensitive to input data and hyperparameter changes, calling into question the scientific validity of existing MI methodologies and necessitating stricter statistical practices in the field.

AINeutralarXiv – CS AI · May 127/10

🧠

Data-driven Circuit Discovery for Interpretability of Language Models

Researchers introduce Data-driven Circuit Discovery (DCD), a new framework for understanding language models that challenges the assumption that models implement tasks using a single computational circuit. By clustering data based on how models process examples, DCD discovers multiple task-specific circuits per dataset, revealing that existing methods conflate distinct mechanisms into single circuits and produce dataset-dependent rather than generalizable interpretations.

AIBullisharXiv – CS AI · Feb 277/105

🧠

Certified Circuits: Stability Guarantees for Mechanistic Circuits

Researchers introduce Certified Circuits, a framework that provides provable stability guarantees for neural network circuit discovery. The method wraps existing algorithms with randomized data subsampling to ensure circuit components remain consistent across dataset variations, achieving 91% higher accuracy while using 45% fewer neurons.

AINeutralarXiv – CS AI · Jun 106/10

🧠

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

Researchers have identified systematic errors in attribution patching, a widely-used gradient-based method for interpreting language model behavior, and developed a Hessian-vector-product correction that eliminates leading-order errors with minimal computational overhead. The work provides practical tools including reliability scores and error bounds, enabling more accurate circuit identification in mechanistic interpretability research across model scales from 124M to 9B parameters.

AINeutralarXiv – CS AI · Jun 96/10

🧠

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

Researchers present a novel methodology for detecting hallucinations in Visual Language Models by measuring sample complexity under counterfactual perturbations. Using circuit discovery techniques and causal influence metrics, they establish empirical bounds on the minimum counterfactual samples needed to reliably identify unstable hallucinated predictions.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

Researchers propose a methodology for validating attention-head circuits in large language models by combining co-activation clustering with causal ablation testing. Their findings reveal that while clustering signals identify circuit proposals, true circuit validation requires closure tests that measure functional impact through ablation—a distinction that challenges current interpretability approaches.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Query Circuits: Explaining How Language Models Answer User Prompts

Researchers introduce query circuits, a method to trace how language models process specific inputs and generate outputs by identifying sparse, faithful neural pathways within the model itself. The approach achieves significant performance recovery using only 1.3% of model connections on benchmark tasks, offering more interpretable AI explanations than existing surrogate-based methods.

AINeutralarXiv – CS AI · May 276/10

🧠

Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation

Researchers introduce a counterfactual-free circuit discovery method adapted for unstructured natural text, enabling Circuit-Targeted Supervised Fine-Tuning (CT-SFT) that improves low-resource model adaptation while preserving performance on source tasks and preventing catastrophic forgetting.

AIBullisharXiv – CS AI · Mar 36/106

🧠

CIRCUS: Circuit Consensus under Uncertainty via Stability Ensembles

Researchers introduce CIRCUS, a new method for discovering mechanistic circuits in AI models that addresses uncertainty and brittleness issues in current approaches. The technique creates ensemble attribution graphs and extracts consensus circuits that are 40x smaller while maintaining explanatory power, validated on Gemma-2-2B and Llama-3.2-1B models.