Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?
Researchers introduce Pando, a benchmark that evaluates mechanistic interpretability methods by controlling for the 'elicitation confounder'—where black-box prompting alone might explain model behavior without requiring white-box tools. Testing 720 models, they find gradient-based attribution and relevance patching improve accuracy by 3-5% when explanations are absent or misleading, but perform poorly when models provide faithful explanations, suggesting interpretability tools may provide limited value for alignment auditing.
Pando addresses a critical methodological gap in mechanistic interpretability research that has significant implications for AI safety and alignment verification. The core insight is that previous evaluations conflate two distinct capabilities: the ability to extract information through prompting versus the ability to recover internal decision mechanisms. By systematically varying whether models provide faithful, absent, or misleading explanations, the researchers isolate what interpretability tools actually contribute beyond simple behavioral elicitation.
The benchmark's findings are counterintuitive and sobering for the interpretability community. Gradient-based attribution methods show modest improvements (3-5 percentage points) only when explanations are unavailable or untrustworthy. However, when models provide genuine explanations, black-box prompting entirely eliminates the advantage of white-box methods. This suggests that in realistic scenarios where models are partially cooperative, interpretability techniques may offer limited additional insight. Tools like logit lens, sparse autoencoders, and circuit tracing provide no reliable benefit across conditions, questioning their utility for alignment auditing.
The variance decomposition analysis revealing that gradients track decision computation while other readouts reflect task representation and field-identity biases explains these disparities. This distinction between tracking actual decision processes versus capturing statistical artifacts is crucial for understanding when interpretability tools have genuine value. For the AI safety community, these results underscore that interpretability gains depend heavily on specific model conditions and explanation fidelity. The release of all models and evaluation infrastructure enables reproducible research and broader validation of these findings.
- →Mechanistic interpretability tools provide measurable benefits only when model explanations are absent or misleading, not when they are faithful.
- →Black-box prompting matches or exceeds white-box interpretability methods when models cooperatively provide accurate explanations.
- →Gradient-based attribution and relevance patching show the most promise, while popular tools like sparse autoencoders and circuit tracing provide unreliable results.
- →The 'elicitation confounder' has likely inflated reported gains in prior interpretability research by failing to control for behavioral prompting.
- →Pando's open-source benchmark enables standardized evaluation of interpretability methods across controlled experimental conditions.