#speech-to-text News & Analysis

6 articles tagged with #speech-to-text. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AIBullisharXiv – CS AI · May 287/10

🧠

Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

Researchers introduce ESRT, a privacy-preserving edge-cloud framework for multilingual speech-to-text translation that processes voice data locally while transmitting only compressed features to the cloud. The system achieves state-of-the-art performance across 45 languages while reducing bandwidth requirements by 10x and preventing voiceprint leakage.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Cross-Attention is Half Explanation in Speech-to-Text Models

Researchers find that cross-attention mechanisms in speech-to-text models only explain about 50% of how the decoder attends to input, contradicting widespread assumptions that attention scores reliably indicate which parts of the audio are most relevant. The study across multiple model scales shows attention provides an incomplete view of the factors driving predictions.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Speech Meets ELF: Audio Conditional Continuous-Target Diffusion for Speech Recognition and Translation

Researchers introduce ELF-S2T, a novel continuous-target generative model for speech-to-text tasks that combines audio conditioning with diffusion-based language modeling. The approach achieves competitive performance on ASR and speech translation while revealing that both tasks share common error patterns rooted in continuous latent space representations.

AINeutralarXiv – CS AI · May 286/10

🧠

Diffusion Large Language Models for Visual Speech Recognition

Researchers introduce DLLM-VSR, a diffusion-based large language model framework for visual speech recognition that replaces traditional left-to-right decoding with iterative masked denoising. The system achieves state-of-the-art 19.5% word error rate on LRS3 by using confidence-based unmasking and length-guided candidate decoding to resolve visual ambiguities.

AIBullishGoogle Research Blog · Jan 136/105

🧠

Next generation medical image interpretation with MedGemma 1.5 and medical speech to text with MedASR

Google has released MedGemma 1.5 for next-generation medical image interpretation and MedASR for medical speech-to-text applications. These new AI tools represent significant advancements in healthcare AI capabilities, focusing on specialized medical applications.

AINeutralTechCrunch – AI · Apr 64/10

🧠

Google quietly releases an offline-first AI dictation app on iOS

Google has quietly launched a new offline-first AI dictation app for iOS that utilizes Gemma AI models. The app appears to be positioning itself as a competitor to existing dictation solutions like Wispr Flow by offering offline functionality.