AINeutralarXiv β CS AI Β· 14h ago7/10
π§
Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?
Researchers introduce Pando, a benchmark that evaluates mechanistic interpretability methods by controlling for the 'elicitation confounder'βwhere black-box prompting alone might explain model behavior without requiring white-box tools. Testing 720 models, they find gradient-based attribution and relevance patching improve accuracy by 3-5% when explanations are absent or misleading, but perform poorly when models provide faithful explanations, suggesting interpretability tools may provide limited value for alignment auditing.