Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information
Researchers propose π-Soft-NC and π-Soft-NS, improved evaluation metrics for assessing input attribution methods in large language models that control for the number of retained words, addressing a fundamental bias in existing faithfulness evaluation frameworks. They also introduce Grad-ELLM, a gradient-based attribution method designed for decoder-only LLMs that combines gradient and attention mechanisms for stronger explanatory performance.
This research addresses a critical gap in AI explainability evaluation methodology. Current soft-perturbation metrics like Soft-NC and Soft-NS inadvertently conflate attribution quality with model behavior, allowing methods that retain more tokens to appear superior regardless of actual explanation quality. This creates a false comparison landscape where better-performing attribution methods may simply be keeping more words rather than identifying truly important inputs.
The problem emerges from how attribution methods are benchmarked. When evaluating which input tokens most influence model outputs, existing metrics don't account for the baseline: a method retaining 80% of tokens will naturally score higher than one retaining 20%, even if both identify equally important information. This methodological flaw has likely skewed progress in explainable AI for language models, as developers optimize for metrics rather than genuine faithfulness.
The proposed π-Soft-NC and π-Soft-NS framework standardizes expected token retention across comparisons, creating an apples-to-apples evaluation environment. Grad-ELLM's innovation lies in combining two complementary signal types—gradient-derived importance capturing numerical sensitivity and attention-derived importance capturing model focus patterns—to create richer attribution signals specific to autoregressive generation.
Industry impact extends across multiple stakeholders. For AI safety researchers and regulators, better attribution methods improve model interpretability and trustworthiness assessment. For LLM developers, more rigorous evaluation frameworks accelerate progress toward genuinely explainable models. The findings suggest previous attribution research may require re-evaluation under these corrected metrics, potentially reshaping the field's technical priorities and redirecting development resources toward methods with authentic explanatory power rather than metric gaming.
- →Existing faithfulness metrics conflate attribution quality with token retention, inflating scores for methods that keep more words
- →π-Soft-NC and π-Soft-NS standardize expected retention probability across attribution methods for rigorous comparison
- →Grad-ELLM combines gradient and attention mechanisms to generate stronger explanations for decoder-only LLMs
- →Corrected evaluation framework may require reassessment of previously published attribution research
- →Improved XAI metrics are foundational for building trustworthy and interpretable large language models