🧠 AI🟢 BullishImportance 7/10

Whisper-MLA: Reducing GPU Memory Consumption of ASR Models based on MHA2MLA Conversion

arXiv – CS AI|Sen Zhang, Jianguo Wei, Wenhuan Lu, Xianghu Yue, Wei Li, Qiang Li, Pengcheng Zhao, Ming Cai, Luo Si|March 3, 2026 at 05:00 AM|7 views

🤖AI Summary

Researchers introduce Whisper-MLA, a modified version of OpenAI's Whisper speech recognition model that uses Multi-Head Latent Attention to reduce GPU memory consumption by up to 87.5% while maintaining accuracy. The innovation addresses a key scalability issue with transformer-based ASR models when processing long-form audio.

Key Takeaways

→Whisper-MLA reduces KV cache size by up to 87.5% compared to standard Whisper models while maintaining competitive accuracy
→The approach applies Multi-Head Latent Attention specifically to decoder self-attention for optimal performance-memory balance
→Existing pretrained Whisper models can be converted to Whisper-MLA with minimal fine-tuning required
→The solution addresses significant GPU memory consumption issues that limit Whisper's use with long-form audio
→Extensive testing on LibriSpeech benchmark validates the effectiveness of the MHA to MLA conversion approach