When Attribution Patching Lies: Diagnosis and a Second-Order Correction
Researchers have identified systematic errors in attribution patching, a widely-used gradient-based method for interpreting language model behavior, and developed a Hessian-vector-product correction that eliminates leading-order errors with minimal computational overhead. The work provides practical tools including reliability scores and error bounds, enabling more accurate circuit identification in mechanistic interpretability research across model scales from 124M to 9B parameters.
Attribution patching has become a standard tool in mechanistic interpretability research because it offers a computationally efficient approximation to activation patching, the gold-standard causal metric for understanding which neural network components drive model behavior. However, practitioners have operated with limited understanding of when and why attribution patching fails, creating risk that circuit discoveries rest on flawed causal analysis. This research directly addresses that knowledge gap by diagnosing the source of attribution patching errors and providing actionable corrections.
The core contribution reveals that attribution patching's unreliability stems primarily from non-linearities in downstream network layers rather than local curvature effects at patched components. This insight reframes how researchers should think about the approximation quality of gradient-based methods in neural networks. The proposed Hessian-vector-product correction is particularly elegant because it eliminates the dominant error term while remaining computationally feasible—requiring only one additional backward pass rather than the exponentially more expensive requirements of alternatives like Integrated Gradients.
For the interpretability community, this work establishes a methodology for auditing mechanistic interpretability findings. The reliability score for detecting untrustworthy estimates enables researchers to apply targeted computational effort through the proposed Screen-Flag-Fix workflow, improving both accuracy and efficiency. Across evaluations on multiple model families, the multi-step HVP variant matches or exceeds Integrated Gradients accuracy at substantially lower computational cost. This efficiency gain becomes increasingly important as language models scale, where standard methods become prohibitively expensive. The research strengthens the empirical foundation of circuit discovery work, a critical prerequisite for trustworthy interpretability research that informs model safety and alignment efforts.
- →Attribution patching errors stem primarily from downstream non-linearities rather than local curvature, enabling targeted correction strategies.
- →Hessian-vector-product correction eliminates leading-order errors with only one additional backward pass, making second-order improvements feasible at scale.
- →A reliability score enables practitioners to identify and flag untrustworthy attribution estimates before they lead to circuit misidentification.
- →Multi-step HVP variants match or exceed Integrated Gradients accuracy while reducing computational cost significantly across model scales.
- →The Screen-Flag-Fix workflow concentrates computational effort only on components flagged as unreliable, improving efficiency in circuit recovery.