WAND: Windowed Attention and Knowledge Distillation for Efficient Autoregressive Text-to-Speech Models
Researchers introduce WAND, a framework that reduces computational and memory costs of autoregressive text-to-speech models by replacing full self-attention with windowed attention combined with knowledge distillation. The approach achieves up to 66.2% KV cache memory reduction while maintaining speech quality, addressing a critical scalability bottleneck in modern AR-TTS systems.
WAND tackles a fundamental efficiency problem in large language model-based text-to-speech systems. Autoregressive TTS models have achieved impressive quality but suffer from quadratic scaling in memory and compute as sequences grow longer. This technical limitation creates practical barriers to deployment in resource-constrained environments like edge devices and mobile applications. The framework's dual-attention architecture—separating global attention over conditioning tokens from local sliding-window attention over generated tokens—represents a pragmatic engineering solution that maintains the model's ability to access long-range context while limiting computational overhead.
The research builds on established efficiency techniques in transformer architectures. Windowed attention patterns have proven effective in language models and other domains, but their application to specialized TTS models requires careful tuning. WAND's curriculum learning strategy addresses the training stability challenges that typically emerge when constraining attention patterns in pretrained models. By progressively tightening the attention window during fine-tuning and leveraging knowledge distillation from full-attention teachers, the authors preserve synthesis quality while reducing resource consumption.
The practical impact extends beyond academic optimization. These efficiency gains directly affect deployment feasibility for AR-TTS systems in production environments where latency and memory constraints matter. A near-constant per-step latency enables more responsive real-time speech synthesis applications. The validation across three modern AR-TTS models suggests broad applicability rather than narrow optimization for specific architectures. For developers building speech synthesis features, particularly in mobile, IoT, or latency-sensitive applications, this framework reduces infrastructure costs and enables previously infeasible use cases.
- →WAND reduces KV cache memory consumption by up to 66.2% through windowed attention while maintaining speech quality.
- →The dual-attention mechanism separates global context access from local token generation, achieving near-constant per-step latency.
- →Curriculum learning during fine-tuning progressively constrains attention windows, stabilizing adaptation of pretrained models.
- →Knowledge distillation from full-attention teachers enables efficient models to recover high-fidelity synthesis quality.
- →Framework demonstrates broad compatibility, improving efficiency across three different modern autoregressive TTS architectures.