🧠 AI🟢 BullishImportance 6/10

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

arXiv – CS AI|Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland|March 2, 2026 at 05:00 AM|15 views

🤖AI Summary

Researchers developed Whisper-LLaDA, a diffusion-based large language model for automatic speech recognition that achieves 12.3% relative improvement over baseline models. The study demonstrates that audio-conditioned embeddings are crucial for accuracy improvements, while plain-text processing without acoustic features fails to enhance performance.

Key Takeaways

→Whisper-LLaDA achieved 2.25%/4.94% WER on LibriSpeech test sets, showing 12.3% relative improvement over Whisper-LLaMA baseline.
→Audio-conditioned embeddings are essential for performance gains, as plain-text LLaDA without acoustic features failed to improve accuracy.
→The diffusion-based model offers faster inference than baseline systems in most experimental configurations.
→Random masking, low-confidence masking, and semi-autoregressive strategies were explored for deliberation-based processing.
→Code and model are open-sourced, enabling further research and development in diffusion-based ASR systems.