y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#speech-recognition News & Analysis

54 articles tagged with #speech-recognition. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

54 articles
AIBullisharXiv – CS AI · Mar 27/1012
🧠

Hello-Chat: Towards Realistic Social Audio Interactions

Researchers have introduced Hello-Chat, an end-to-end audio language model designed to create more realistic and emotionally resonant AI conversations. The model addresses the robotic nature of existing Large Audio Language Models by using real-life conversation data and achieving breakthrough performance in prosodic naturalness and emotional alignment.

AIBullisharXiv – CS AI · Mar 26/1010
🧠

SHINE: Sequential Hierarchical Integration Network for EEG and MEG

Researchers developed SHINE, a Sequential Hierarchical Integration Network for analyzing brain signals (EEG/MEG) to detect speech from neural activity. The system achieved high F1-macro scores of 0.9155-0.9184 in the LibriBrain Competition 2025 by reconstructing speech-silence patterns from magnetoencephalography signals.

AIBullisharXiv – CS AI · Feb 276/107
🧠

Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

Researchers developed a new AI framework using RNN-T architecture to improve speech recognition for Taiwanese Hakka, an endangered low-resource language with high dialectal variability. The system achieved 57% and 40% relative error rate reductions for two different writing systems, marking the first systematic investigation into Hakka dialect variations in ASR.

AINeutralApple Machine Learning · Feb 256/103
🧠

Closing the Gap Between Text and Speech Understanding in LLMs

Research identifies a significant performance gap between speech-adapted Large Language Models and their text-based counterparts on language understanding tasks. Current approaches to bridge this gap rely on expensive large-scale speech synthesis methods, highlighting a key challenge in extending LLM capabilities to audio inputs.

AINeutralApple Machine Learning · Feb 246/102
🧠

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Researchers introduce AMUSE, a new benchmark for evaluating multimodal large language models in multi-speaker dialogue scenarios. The framework addresses current limitations of models like GPT-4o in tracking speakers, maintaining conversational roles, and reasoning across audio-visual streams in applications such as conversational video assistants.

AIBullishMicrosoft Research Blog · Feb 56/103
🧠

Paza: Introducing automatic speech recognition benchmarks and models for low resource languages

Microsoft Research launched Paza, a human-centered speech recognition pipeline, and PazaBench, the first benchmark leaderboard specifically designed for low-resource languages. The initiative covers 39 African languages with 52 models and has been tested with real communities to improve AI accessibility for underrepresented languages.

AINeutralarXiv – CS AI · Mar 264/10
🧠

From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Researchers developed a new training framework to address contextual exposure bias in Speech-LLMs, where models trained on perfect conversation history perform poorly with error-prone real-world context. Their approach combines teacher error knowledge, context dropout, and direct preference optimization to improve robustness, achieving WER reductions from 5.59% to 5.17% on TED-LIUM 3.

AIBullisharXiv – CS AI · Mar 175/10
🧠

Speech Recognition on TV Series with Video-guided Post-ASR Correction

Researchers have developed a Video-Guided Post-ASR Correction (VPC) framework that uses Video-Large Multimodal Models to improve speech recognition accuracy in complex environments like TV series. The system addresses challenges with multiple speakers, overlapping speech, and domain-specific terminology by leveraging video context to refine ASR outputs.

AINeutralarXiv – CS AI · Mar 175/10
🧠

Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition

Researchers developed a novel Bayesian Low-rank Adaptation method for personalizing automatic speech recognition systems to better understand impaired speech. The approach addresses challenges in ASR systems like Whisper that struggle with non-normative speech patterns from conditions like cerebral palsy, using data-efficient fine-tuning on English and German datasets.

AIBullisharXiv – CS AI · Mar 175/10
🧠

Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation

Researchers developed a reproducible pipeline to transform public Zoom recordings into speaker-attributed transcripts for training LLMs to simulate realistic civic deliberations. The method achieved 67% reduction in perplexity and nearly doubled performance metrics, with human evaluations showing simulations often indistinguishable from real government meetings.

🏢 Perplexity
AINeutralarXiv – CS AI · Mar 44/103
🧠

On the Parameter Estimation of Sinusoidal Models for Speech and Audio Signals

Research paper compares three sinusoidal models for speech and audio signal processing: standard Sinusoidal Model (SM), Exponentially Damped Sinusoidal Model (EDSM), and extended adaptive Quasi-Harmonic Model (eaQHM). The study finds eaQHM performs better for medium-to-large window analysis while EDSM excels with smaller analysis windows, suggesting future research should combine both approaches.

AIBullisharXiv – CS AI · Mar 35/105
🧠

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

Researchers developed a multi-pass LLM post-processing system that significantly improves French clinical speech transcription accuracy by alternating between speaker recognition and word recognition passes. The system achieved significant word error rate reductions in suicide prevention conversations while maintaining stability in neurosurgery consultations with feasible computational costs for clinical deployment.

AINeutralHugging Face Blog · Nov 214/108
🧠

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

The article title suggests coverage of the Open ASR (Automatic Speech Recognition) Leaderboard, focusing on trends and insights with new multilingual and long-form evaluation tracks. However, the article body appears to be empty or not provided, limiting the ability to extract specific details about ASR developments.

AINeutralHugging Face Blog · Jan 194/104
🧠

Fine-Tune W2V2-Bert for low-resource ASR with 🤗 Transformers

The article appears to be about fine-tuning W2V2-Bert (Wav2Vec2-BERT) for automatic speech recognition in low-resource languages using Hugging Face Transformers. However, the article body is empty, preventing detailed analysis of the technical implementation or methodology.

AIBullishHugging Face Blog · Dec 204/104
🧠

Speculative Decoding for 2x Faster Whisper Inference

The article title suggests a technical advancement in Whisper inference using speculative decoding to achieve 2x faster processing speeds. However, no article body content was provided to analyze the specific implementation or implications.

AINeutralHugging Face Blog · Jun 194/106
🧠

Fine-Tune MMS Adapter Models for low-resource ASR

The article discusses fine-tuning MMS (Massively Multilingual Speech) adapter models for automatic speech recognition (ASR) in low-resource language scenarios. This approach aims to improve speech recognition performance for languages with limited training data by leveraging pre-trained multilingual models and adapter techniques.

AINeutralHugging Face Blog · Nov 34/106
🧠

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

The article appears to discuss fine-tuning Whisper, OpenAI's automatic speech recognition model, for multilingual applications using Hugging Face Transformers library. However, the article body is empty, making detailed analysis impossible.

AINeutralHugging Face Blog · Jan 124/105
🧠

Boosting Wav2Vec2 with n-grams in 🤗 Transformers

The article appears to discuss technical improvements to Wav2Vec2, a speech recognition model, by incorporating n-gram language models within the Hugging Face Transformers library. This represents an advancement in AI speech processing technology that could enhance accuracy and performance of speech-to-text applications.

← PrevPage 2 of 3Next →