#speech-recognition News & Analysis

54 articles tagged with #speech-recognition. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

54 articles

AIBullisharXiv – CS AI · Mar 27/1012

🧠

Hello-Chat: Towards Realistic Social Audio Interactions

Researchers have introduced Hello-Chat, an end-to-end audio language model designed to create more realistic and emotionally resonant AI conversations. The model addresses the robotic nature of existing Large Audio Language Models by using real-life conversation data and achieving breakthrough performance in prosodic naturalness and emotional alignment.

AIBullisharXiv – CS AI · Mar 26/1010

🧠

SHINE: Sequential Hierarchical Integration Network for EEG and MEG

Researchers developed SHINE, a Sequential Hierarchical Integration Network for analyzing brain signals (EEG/MEG) to detect speech from neural activity. The system achieved high F1-macro scores of 0.9155-0.9184 in the LibriBrain Competition 2025 by reconstructing speech-silence patterns from magnetoencephalography signals.

AIBullisharXiv – CS AI · Feb 276/107

🧠

Efficient Dialect-Aware Modeling and Conditioning for Low-Resource Taiwanese Hakka Speech Processing

Researchers developed a new AI framework using RNN-T architecture to improve speech recognition for Taiwanese Hakka, an endangered low-resource language with high dialectal variability. The system achieved 57% and 40% relative error rate reductions for two different writing systems, marking the first systematic investigation into Hakka dialect variations in ASR.

AINeutralApple Machine Learning · Feb 256/103

🧠

Closing the Gap Between Text and Speech Understanding in LLMs

Research identifies a significant performance gap between speech-adapted Large Language Models and their text-based counterparts on language understanding tasks. Current approaches to bridge this gap rely on expensive large-scale speech synthesis methods, highlighting a key challenge in extending LLM capabilities to audio inputs.

AINeutralApple Machine Learning · Feb 246/102

🧠

AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding

Researchers introduce AMUSE, a new benchmark for evaluating multimodal large language models in multi-speaker dialogue scenarios. The framework addresses current limitations of models like GPT-4o in tracking speakers, maintaining conversational roles, and reasoning across audio-visual streams in applications such as conversational video assistants.

AIBullishMicrosoft Research Blog · Feb 56/103

🧠

Paza: Introducing automatic speech recognition benchmarks and models for low resource languages

Microsoft Research launched Paza, a human-centered speech recognition pipeline, and PazaBench, the first benchmark leaderboard specifically designed for low-resource languages. The initiative covers 39 African languages with 52 models and has been tested with real communities to improve AI accessibility for underrepresented languages.

AINeutralarXiv – CS AI · Mar 264/10

🧠

From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Researchers developed a new training framework to address contextual exposure bias in Speech-LLMs, where models trained on perfect conversation history perform poorly with error-prone real-world context. Their approach combines teacher error knowledge, context dropout, and direct preference optimization to improve robustness, achieving WER reductions from 5.59% to 5.17% on TED-LIUM 3.

AIBullisharXiv – CS AI · Mar 175/10

🧠

Speech Recognition on TV Series with Video-guided Post-ASR Correction

Researchers have developed a Video-Guided Post-ASR Correction (VPC) framework that uses Video-Large Multimodal Models to improve speech recognition accuracy in complex environments like TV series. The system addresses challenges with multiple speakers, overlapping speech, and domain-specific terminology by leveraging video context to refine ASR outputs.

AINeutralarXiv – CS AI · Mar 175/10

🧠

Variational Low-Rank Adaptation for Personalized Impaired Speech Recognition

Researchers developed a novel Bayesian Low-rank Adaptation method for personalizing automatic speech recognition systems to better understand impaired speech. The approach addresses challenges in ASR systems like Whisper that struggle with non-normative speech patterns from conditions like cerebral palsy, using data-efficient fine-tuning on English and German datasets.

AIBullisharXiv – CS AI · Mar 175/10

🧠

Point of Order: Action-Aware LLM Persona Modeling for Realistic Civic Simulation

Researchers developed a reproducible pipeline to transform public Zoom recordings into speaker-attributed transcripts for training LLMs to simulate realistic civic deliberations. The method achieved 67% reduction in perplexity and nearly doubled performance metrics, with human evaluations showing simulations often indistinguishable from real government meetings.

🏢 Perplexity

AINeutralarXiv – CS AI · Mar 54/10

🧠

ACES: Accent Subspaces for Coupling, Explanations, and Stress-Testing in Automatic Speech Recognition

Researchers introduce ACES, a new method to analyze how automatic speech recognition systems perform differently across accents. The study finds that accent information is concentrated in early neural network layers and is deeply intertwined with speech recognition capabilities, making simple bias removal ineffective.

AINeutralarXiv – CS AI · Mar 44/103

🧠

On the Parameter Estimation of Sinusoidal Models for Speech and Audio Signals

Research paper compares three sinusoidal models for speech and audio signal processing: standard Sinusoidal Model (SM), Exponentially Damped Sinusoidal Model (EDSM), and extended adaptive Quasi-Harmonic Model (eaQHM). The study finds eaQHM performs better for medium-to-large window analysis while EDSM excels with smaller analysis windows, suggesting future research should combine both approaches.

AINeutralarXiv – CS AI · Mar 44/103

🧠

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Researchers introduce Whisper-RIR-Mega, a new benchmark dataset for testing automatic speech recognition robustness in reverberant acoustic environments. The study evaluates five Whisper models and finds that reverberation consistently degrades performance across all model sizes, with word error rates increasing by 0.12 to 1.07 percentage points.

AINeutralarXiv – CS AI · Mar 44/104

🧠

MEBM-Phoneme: Multi-scale Enhanced BrainMagic for End-to-End MEG Phoneme Classification

Researchers developed MEBM-Phoneme, a neural decoder that uses magnetoencephalography (MEG) brain signals to classify phonemes with enhanced accuracy. The system integrates multi-scale convolutional modules and attention mechanisms to improve speech perception analysis from non-invasive brain recordings.

AIBullisharXiv – CS AI · Mar 44/102

🧠

An Investigation Into Various Approaches For Bengali Long-Form Speech Transcription and Bengali Speaker Diarization

Researchers developed a multistage AI approach for Bengali speech transcription and speaker diarization, achieving significant improvements in processing long-form audio recordings. The system used fine-tuned Whisper models and custom segmentation techniques to address the low-resource nature of Bengali in speech technology applications.

AIBullisharXiv – CS AI · Mar 35/105

🧠

Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

Researchers developed a multi-pass LLM post-processing system that significantly improves French clinical speech transcription accuracy by alternating between speaker recognition and word recognition passes. The system achieved significant word error rate reductions in suicide prevention conversations while maintaining stability in neurosurgery consultations with feasible computational costs for clinical deployment.

AINeutralarXiv – CS AI · Feb 274/102

🧠

A Holistic Framework for Robust Bangla ASR and Speaker Diarization with Optimized VAD and CTC Alignment

Researchers developed a robust framework for Bangla automatic speech recognition and speaker diarization that can handle long-form audio exceeding 30-60 seconds. The system uses Voice Activity Detection optimization and Connectionist Temporal Classification segmentation to maintain accuracy over extended durations in multi-speaker environments.

AINeutralHugging Face Blog · Nov 214/108

🧠

Open ASR Leaderboard: Trends and Insights with New Multilingual & Long-Form Tracks

The article title suggests coverage of the Open ASR (Automatic Speech Recognition) Leaderboard, focusing on trends and insights with new multilingual and long-form evaluation tracks. However, the article body appears to be empty or not provided, limiting the ability to extract specific details about ASR developments.

AIBullishHugging Face Blog · May 15/106

🧠

Powerful ASR + diarization + speculative decoding with Hugging Face Inference Endpoints

The article appears to discuss advanced AI speech processing technologies including Automatic Speech Recognition (ASR), speaker diarization, and speculative decoding capabilities available through Hugging Face Inference Endpoints. However, the article body content is not provided for detailed analysis.

AINeutralHugging Face Blog · Jan 194/104

🧠

Fine-Tune W2V2-Bert for low-resource ASR with 🤗 Transformers

The article appears to be about fine-tuning W2V2-Bert (Wav2Vec2-BERT) for automatic speech recognition in low-resource languages using Hugging Face Transformers. However, the article body is empty, preventing detailed analysis of the technical implementation or methodology.

AIBullishHugging Face Blog · Dec 204/104

🧠

Speculative Decoding for 2x Faster Whisper Inference

The article title suggests a technical advancement in Whisper inference using speculative decoding to achieve 2x faster processing speeds. However, no article body content was provided to analyze the specific implementation or implications.

AINeutralHugging Face Blog · Jun 194/106

🧠

Fine-Tune MMS Adapter Models for low-resource ASR

The article discusses fine-tuning MMS (Massively Multilingual Speech) adapter models for automatic speech recognition (ASR) in low-resource language scenarios. This approach aims to improve speech recognition performance for languages with limited training data by leveraging pre-trained multilingual models and adapter techniques.

AINeutralHugging Face Blog · Nov 34/106

🧠

Fine-Tune Whisper For Multilingual ASR with 🤗 Transformers

The article appears to discuss fine-tuning Whisper, OpenAI's automatic speech recognition model, for multilingual applications using Hugging Face Transformers library. However, the article body is empty, making detailed analysis impossible.

AINeutralHugging Face Blog · Feb 14/107

🧠

Making automatic speech recognition work on large files with Wav2Vec2 in 🤗 Transformers

The article appears to discuss implementing automatic speech recognition for processing large audio files using Wav2Vec2 model in Hugging Face Transformers library. However, the article body is empty, preventing detailed analysis of the technical implementation or implications.

AINeutralHugging Face Blog · Jan 124/105

🧠

Boosting Wav2Vec2 with n-grams in 🤗 Transformers

The article appears to discuss technical improvements to Wav2Vec2, a speech recognition model, by incorporating n-gram language models within the Hugging Face Transformers library. This represents an advancement in AI speech processing technology that could enhance accuracy and performance of speech-to-text applications.

← PrevPage 2 of 3Next →