AINeutralarXiv – CS AI · 6h ago6/10
🧠
Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation
Researchers introduce ELF-S2T, a novel continuous-target generative model for speech-to-text tasks that combines audio conditioning with diffusion-based language modeling. The approach achieves competitive performance on ASR and speech translation while revealing that both tasks share common error patterns rooted in continuous latent space representations.