βBack to feed
π§ AIπ’ BullishImportance 6/10
Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing
arXiv β CS AI|Mengqi Wang, Zhan Liu, Zengrui Jin, Guangzhi Sun, Chao Zhang, Philip C. Woodland||15 views
π€AI Summary
Researchers developed Whisper-LLaDA, a diffusion-based large language model for automatic speech recognition that achieves 12.3% relative improvement over baseline models. The study demonstrates that audio-conditioned embeddings are crucial for accuracy improvements, while plain-text processing without acoustic features fails to enhance performance.
Key Takeaways
- βWhisper-LLaDA achieved 2.25%/4.94% WER on LibriSpeech test sets, showing 12.3% relative improvement over Whisper-LLaMA baseline.
- βAudio-conditioned embeddings are essential for performance gains, as plain-text LLaDA without acoustic features failed to improve accuracy.
- βThe diffusion-based model offers faster inference than baseline systems in most experimental configurations.
- βRandom masking, low-confidence masking, and semi-autoregressive strategies were explored for deliberation-based processing.
- βCode and model are open-sourced, enabling further research and development in diffusion-based ASR systems.
#diffusion-models#large-language-models#automatic-speech-recognition#asr#whisper#llama#audio-processing#machine-learning#open-source
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles