AINeutralarXiv – CS AI · 3h ago6/10
🧠
Diffusion Large Language Models for Visual Speech Recognition
Researchers introduce DLLM-VSR, a diffusion-based large language model framework for visual speech recognition that replaces traditional left-to-right decoding with iterative masked denoising. The system achieves state-of-the-art 19.5% word error rate on LRS3 by using confidence-based unmasking and length-guided candidate decoding to resolve visual ambiguities.