y0news
← Feed
←Back to feed
🧠 AI🟒 BullishImportance 6/10

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

arXiv – CS AI|Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland||15 views
πŸ€–AI Summary

Researchers developed Whisper-LLaDA, a diffusion-based large language model for automatic speech recognition that achieves 12.3% relative improvement over baseline models. The study demonstrates that audio-conditioned embeddings are crucial for accuracy improvements, while plain-text processing without acoustic features fails to enhance performance.

Key Takeaways
  • β†’Whisper-LLaDA achieved 2.25%/4.94% WER on LibriSpeech test sets, showing 12.3% relative improvement over Whisper-LLaMA baseline.
  • β†’Audio-conditioned embeddings are essential for performance gains, as plain-text LLaDA without acoustic features failed to improve accuracy.
  • β†’The diffusion-based model offers faster inference than baseline systems in most experimental configurations.
  • β†’Random masking, low-confidence masking, and semi-autoregressive strategies were explored for deliberation-based processing.
  • β†’Code and model are open-sourced, enabling further research and development in diffusion-based ASR systems.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles