Do We Need Distinct Representations for Every Speech Token? Unveiling and Exploiting Redundancy in Large Speech Language Models
Researchers demonstrate that large speech language models contain significant redundancy in their token representations, particularly in deeper layers. By introducing Affinity Pooling, a training-free token merging technique, they achieve 27.48% reduction in prefilling FLOPs and up to 1.7× memory savings while maintaining semantic accuracy, challenging the necessity of fully distinct tokens for acoustic processing.
This research addresses a fundamental efficiency bottleneck in large speech language models: the computational overhead created by processing tokens at rates far exceeding semantic information density. The study's core contribution lies in empirically mapping where redundancy exists within model architectures, revealing that shallow layers preserve critical acoustic details while deep layers tolerate aggressive compression—a structured hierarchy that enables targeted optimization rather than uniform compression strategies.
The efficiency gains stem from understanding that speech processing inherently contains redundancy because acoustic signals encode information at high temporal resolution while semantic content changes more slowly. This mirrors similar findings in vision transformers and language models, where token pruning and merging techniques have proven effective. Affinity Pooling builds on this foundation by using similarity-based merging without requiring retraining, making it immediately applicable to existing models.
For the AI infrastructure sector, these results have tangible implications. A 1.7× memory reduction and 1.1× faster time-to-first-token directly impact deployment costs and user experience for speech-based applications. Organizations running inference-heavy workloads could achieve significant operational savings. The practical deployment confirmation suggests this isn't merely a theoretical optimization—it delivers real-world efficiency improvements that justify implementation across production systems.
Looking ahead, this work opens questions about optimal token rates for speech models and whether current architectures represent over-engineered solutions to semantic understanding tasks. Future research may explore task-specific compression profiles and whether these findings generalize across different speech domains and languages.
- →Affinity Pooling reduces prefilling FLOPs by 27.48% without retraining or semantic loss
- →Deep layers in speech models exhibit extreme redundancy compared to shallow layers encoding acoustic details
- →Memory usage drops by up to 1.7× and time-to-first-token improves by 1.1× on long utterances
- →Training-free token merging mechanisms challenge assumptions about necessary granularity in speech token processing
- →Structured redundancy analysis reveals hierarchical compression opportunities across model depths