Whisfusion: Parallel ASR Decoding with Masked Diffusion
Whisfusion introduces a masked diffusion decoder that achieves faster speech-to-text processing than Whisper-large-v3 while matching or exceeding its accuracy across multilingual benchmarks. By replacing autoregressive decoding with parallel diffusion decoding, the system runs 4-5x faster while maintaining competitive performance with leading ASR systems, establishing non-autoregressive diffusion as a viable paradigm for high-throughput transcription.
Whisfusion addresses a fundamental trade-off in automatic speech recognition: autoregressive models like Whisper deliver high accuracy but process transcripts sequentially, creating latency that scales with output length. This limitation constrains real-time applications and high-throughput scenarios. The research team solved this by combining a frozen Whisper-large-v3 encoder with a dedicated masked diffusion decoder, leveraging recent advances in diffusion-based text generation to maintain accuracy while enabling parallel processing.
The technical approach builds on masked diffusion language models, which have gained traction as alternatives to autoregressive generation across multiple domains. Rather than predicting tokens left-to-right, masked diffusion iteratively refines all positions simultaneously from noise, reducing the sequential dependency bottleneck. Whisfusion's key innovation involves high-mask specialization—aligning training with the fully masked starting point of inference—and Parallel Diffusion Decoding for efficient multi-step denoising.
The empirical results demonstrate substantial practical value. Whisfusion surpasses Whisper-large-v3 on group-average accuracy across English, European, and CJK language benchmarks while achieving 4-5x speedup. It also outperforms Whisper-turbo in both metrics and competes with specialized systems like Canary and Qwen3-ASR at 3-7x higher throughput. These results matter because they shift the Pareto frontier for multilingual ASR, making high-accuracy transcription economically viable at scale.
The broader significance lies in validating non-autoregressive paradigms for structured prediction tasks. As diffusion models mature across modalities, similar architectural patterns may optimize other sequence-to-sequence problems. For practitioners, this signals that trading sequential auto-regressive modeling for parallel diffusion denoising can yield both speed and accuracy gains, particularly in latency-sensitive applications like real-time transcription and live streaming.
- →Whisfusion achieves 4-5x faster inference than Whisper-large-v3 while improving accuracy across multilingual benchmarks.
- →Masked diffusion decoding enables parallel processing of entire transcripts, eliminating the latency bottleneck inherent in left-to-right autoregressive models.
- →The system matches or exceeds performance of specialized ASR models like Canary and Qwen3-ASR while running substantially faster.
- →High-mask specialization during training aligns the model with the fully masked inference starting point, improving efficiency and effectiveness.
- →Open-source release of code and model weights accelerates adoption and enables further research into non-autoregressive speech recognition.