Position: Mechanistic Interpretability Must Disclose Identification Assumptions for Causal Claims
A research paper argues that mechanistic interpretability studies increasingly make causal claims without explicitly stating their identification assumptions, creating a credibility gap in AI research. The authors audit 10 papers across multiple methodologies and find none contain dedicated identification-assumptions sections, proposing a new disclosure norm requiring researchers to clearly state causal claims, identification strategies, and the assumptions underpinning their conclusions.
Mechanistic interpretability research has become central to understanding neural network behavior, yet the field faces a methodological blind spot: researchers deploy causal language—circuits, mediators, causal abstraction—without articulating the statistical assumptions that justify causal inference. This paper identifies a systemic problem where validation metrics like faithfulness, completeness, and ablation effects serve as proxies for causal support, conflating validation (does the model work?) with identification (does the mechanism truly cause the outcome?). The distinction matters profoundly because a metric can be high without identifying causal relationships; spurious correlations, confounding variables, or measurement error can inflate confidence in false claims.
The audit methodology is rigorous: a purposive review of 10 papers across four methodological strands reveals consistent absence of explicit identification-assumption sections, followed by a secondary human-coded audit of 30 samples confirming the pattern. This gap reflects broader challenges in AI research where empirical validation often substitutes for causal rigor. As mechanistic interpretability informs AI safety decisions and regulatory frameworks, unstated assumptions pose epistemic and practical risks. A model circuit identified through ablation might capture correlation rather than causation, leading researchers and policymakers to misunderstand how AI systems actually function.
The proposed disclosure norm—requiring explicit causal claims, named identification strategies, enumerated assumptions, stress-testing of at least one assumption, and counterfactual reasoning about conclusion robustness—raises research standards without imposing impossible demands. Implementation requires modest overhead but substantial clarity gains. For the AI safety and interpretability communities, this represents a critical juncture where methodological rigor can prevent downstream errors in alignment research and model auditing.
- →Mechanistic interpretability papers frequently claim causality without disclosing the statistical assumptions required to justify such claims.
- →Validation metrics and causal identification are distinct concepts; high metric scores do not prove causal mechanisms.
- →Audit of 10 papers found zero dedicated identification-assumptions sections, indicating systemic methodological gaps.
- →Proposed disclosure norm requires explicit causal language, identification strategies, assumption enumeration, and robustness testing.
- →Unstated assumptions in interpretability research could lead AI safety researchers to misunderstand model behavior and misdirect safety interventions.