y0news
← Feed
Back to feed
🧠 AI🟢 Bullish

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

arXiv – CS AI|Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland||5 views
🤖AI Summary

Researchers developed Whisper-LLaDA, a diffusion-based large language model for automatic speech recognition that achieves 12.3% relative improvement over baseline models. The study demonstrates that audio-conditioned embeddings are crucial for accuracy improvements, while plain-text processing without acoustic features fails to enhance performance.

Key Takeaways
  • Whisper-LLaDA achieved 2.25%/4.94% WER on LibriSpeech test sets, showing 12.3% relative improvement over Whisper-LLaMA baseline.
  • Audio-conditioned embeddings are essential for performance gains, as plain-text LLaDA without acoustic features failed to improve accuracy.
  • The diffusion-based model offers faster inference than baseline systems in most experimental configurations.
  • Random masking, low-confidence masking, and semi-autoregressive strategies were explored for deliberation-based processing.
  • Code and model are open-sourced, enabling further research and development in diffusion-based ASR systems.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles