How Do Instructions Shape Speech? Cross-Attention Attribution for Style-Captioned Text-to-Speech
Researchers propose a cross-attention attribution method for style-captioned text-to-speech systems, adapting the DAAM framework to speech diffusion models for the first time. Analysis of 3,600 style-caption and text combinations reveals how individual words influence acoustic output, showing that style tokens condition voice characteristics globally while peaking in early generation steps and deep network layers.
This research addresses a fundamental interpretability challenge in expressive text-to-speech systems: understanding how natural language instructions translate into acoustic modifications. The work bridges human-computer interaction and machine learning by systematically mapping which caption tokens drive changes in fundamental frequency, energy, and other voice characteristics across the generation pipeline.
The development of attribution methods for speech diffusion models represents a maturation of interpretability research beyond vision and language domains. Prior work on cross-attention visualization (DAAM) has proven valuable for understanding image generation, but applying these techniques to speech required adaptation for temporal audio characteristics and diffusion-specific dynamics. The analysis of 25 transformer layers and 24 ODE integration steps across thousands of examples provides statistically robust evidence about where and when style conditioning operates.
For practitioners building controllable TTS systems, these findings offer actionable insights into system behavior. The discovery that style attention concentrates in early diffusion steps and deeper layers suggests where architectural interventions could improve controllability or efficiency. The minimum attention entropy at layer 17 indicates this stage functions as a critical decision point—potentially a compression bottleneck where style semantics crystallize into acoustic parameters.
These insights have practical implications for debugging failure modes in style control and designing more interpretable architectures. As expressive TTS systems become more prevalent in applications requiring precise voice control, understanding their internal attention patterns enables better system design and user experience. Future work may extend these methods to other modalities or improve real-time controllability based on these attribution patterns.
- →Style tokens exhibit lower temporal variance than content tokens, confirming global conditioning mechanisms work as intended
- →Style attention correlates directly with acoustic features like F0 and energy, validating the attribution method's relevance
- →Style conditioning peaks during early diffusion steps and in deep network layers, suggesting optimal intervention points
- →Minimum attention entropy at layer 17 marks the network's most selective and style-critical processing stage
- →First systematic study of cross-attention attribution in speech diffusion models opens new interpretability research directions