Cross-Attention is Half Explanation in Speech-to-Text Models
Researchers find that cross-attention mechanisms in speech-to-text models only explain about 50% of how the decoder attends to input, contradicting widespread assumptions that attention scores reliably indicate which parts of the audio are most relevant. The study across multiple model scales shows attention provides an incomplete view of the factors driving predictions.
This research challenges a fundamental assumption in deep learning interpretability: that attention weights reliably explain model behavior. The study systematically compared cross-attention scores against saliency maps derived from feature attribution methods across diverse speech-to-text architectures, revealing a significant gap between what attention visualizations suggest and what actually drives predictions.
The findings emerge from a growing tension in machine learning research. While attention mechanisms became ubiquitous partly due to their perceived interpretability—offering a window into model reasoning—empirical evidence increasingly suggests this interpretability is overstated. In the speech domain specifically, attention scores have been repurposed for downstream applications like timestamp estimation and audio-text alignment, with practitioners assuming these scores reliably reflect input relevance. This work exposes that assumption as problematic.
The implications extend beyond academic curiosity. Engineers and researchers building speech systems rely on attention visualizations for debugging, model validation, and feature engineering. If attention captures only half the explanatory picture, decisions based solely on these visualizations risk missing important model dynamics or introducing subtle biases. For developers of speech-to-text systems—whether for accessibility, transcription, or multilingual applications—this suggests investing in complementary interpretation methods rather than depending entirely on attention analysis.
Future work should explore why attention provides incomplete explanations and develop more robust interpretation frameworks that integrate multiple attribution methods. For practitioners, the takeaway is straightforward: treat attention as informative but insufficient, and validate assumptions about model behavior through additional analysis techniques.
- →Cross-attention in S2T models explains only 50% of input relevance, capturing just 52-75% of saliency-based explanations
- →Attention scores moderately align with saliency maps when aggregated across heads and layers, but individual layer analysis reveals limitations
- →Current practice of using attention weights for downstream tasks like timestamp estimation may miss important model dynamics
- →The study spans monolingual, multilingual, single-task, and multi-task models at multiple scales, establishing broad applicability
- →Researchers should develop complementary interpretation methods beyond attention visualization for more complete model understanding