AIBullisharXiv – CS AI · Mar 27/1012
🧠Researchers have introduced Hello-Chat, an end-to-end audio language model designed to create more realistic and emotionally resonant AI conversations. The model addresses the robotic nature of existing Large Audio Language Models by using real-life conversation data and achieving breakthrough performance in prosodic naturalness and emotional alignment.
AIBullisharXiv – CS AI · Mar 26/1010
🧠Researchers developed SHINE, a Sequential Hierarchical Integration Network for analyzing brain signals (EEG/MEG) to detect speech from neural activity. The system achieved high F1-macro scores of 0.9155-0.9184 in the LibriBrain Competition 2025 by reconstructing speech-silence patterns from magnetoencephalography signals.
AIBullisharXiv – CS AI · Feb 276/107
🧠Researchers developed a new AI framework using RNN-T architecture to improve speech recognition for Taiwanese Hakka, an endangered low-resource language with high dialectal variability. The system achieved 57% and 40% relative error rate reductions for two different writing systems, marking the first systematic investigation into Hakka dialect variations in ASR.
AINeutralApple Machine Learning · Feb 256/103
🧠Research identifies a significant performance gap between speech-adapted Large Language Models and their text-based counterparts on language understanding tasks. Current approaches to bridge this gap rely on expensive large-scale speech synthesis methods, highlighting a key challenge in extending LLM capabilities to audio inputs.
AINeutralApple Machine Learning · Feb 246/102
🧠Researchers introduce AMUSE, a new benchmark for evaluating multimodal large language models in multi-speaker dialogue scenarios. The framework addresses current limitations of models like GPT-4o in tracking speakers, maintaining conversational roles, and reasoning across audio-visual streams in applications such as conversational video assistants.
AIBullishMicrosoft Research Blog · Feb 56/103
🧠Microsoft Research launched Paza, a human-centered speech recognition pipeline, and PazaBench, the first benchmark leaderboard specifically designed for low-resource languages. The initiative covers 39 African languages with 52 models and has been tested with real communities to improve AI accessibility for underrepresented languages.
AINeutralarXiv – CS AI · Mar 264/10
🧠Researchers developed a new training framework to address contextual exposure bias in Speech-LLMs, where models trained on perfect conversation history perform poorly with error-prone real-world context. Their approach combines teacher error knowledge, context dropout, and direct preference optimization to improve robustness, achieving WER reductions from 5.59% to 5.17% on TED-LIUM 3.
AIBullisharXiv – CS AI · Mar 175/10
🧠Researchers have developed a Video-Guided Post-ASR Correction (VPC) framework that uses Video-Large Multimodal Models to improve speech recognition accuracy in complex environments like TV series. The system addresses challenges with multiple speakers, overlapping speech, and domain-specific terminology by leveraging video context to refine ASR outputs.
AINeutralarXiv – CS AI · Mar 175/10
🧠Researchers developed a novel Bayesian Low-rank Adaptation method for personalizing automatic speech recognition systems to better understand impaired speech. The approach addresses challenges in ASR systems like Whisper that struggle with non-normative speech patterns from conditions like cerebral palsy, using data-efficient fine-tuning on English and German datasets.
AIBullisharXiv – CS AI · Mar 175/10
🧠Researchers developed a reproducible pipeline to transform public Zoom recordings into speaker-attributed transcripts for training LLMs to simulate realistic civic deliberations. The method achieved 67% reduction in perplexity and nearly doubled performance metrics, with human evaluations showing simulations often indistinguishable from real government meetings.
🏢 Perplexity
AINeutralarXiv – CS AI · Mar 54/10
🧠Researchers introduce ACES, a new method to analyze how automatic speech recognition systems perform differently across accents. The study finds that accent information is concentrated in early neural network layers and is deeply intertwined with speech recognition capabilities, making simple bias removal ineffective.
AINeutralarXiv – CS AI · Mar 44/103
🧠Research paper compares three sinusoidal models for speech and audio signal processing: standard Sinusoidal Model (SM), Exponentially Damped Sinusoidal Model (EDSM), and extended adaptive Quasi-Harmonic Model (eaQHM). The study finds eaQHM performs better for medium-to-large window analysis while EDSM excels with smaller analysis windows, suggesting future research should combine both approaches.
AINeutralarXiv – CS AI · Mar 44/103
🧠Researchers introduce Whisper-RIR-Mega, a new benchmark dataset for testing automatic speech recognition robustness in reverberant acoustic environments. The study evaluates five Whisper models and finds that reverberation consistently degrades performance across all model sizes, with word error rates increasing by 0.12 to 1.07 percentage points.
AINeutralarXiv – CS AI · Mar 44/104
🧠Researchers developed MEBM-Phoneme, a neural decoder that uses magnetoencephalography (MEG) brain signals to classify phonemes with enhanced accuracy. The system integrates multi-scale convolutional modules and attention mechanisms to improve speech perception analysis from non-invasive brain recordings.
AIBullisharXiv – CS AI · Mar 44/102
🧠Researchers developed a multistage AI approach for Bengali speech transcription and speaker diarization, achieving significant improvements in processing long-form audio recordings. The system used fine-tuned Whisper models and custom segmentation techniques to address the low-resource nature of Bengali in speech technology applications.
AIBullisharXiv – CS AI · Mar 35/105
🧠Researchers developed a multi-pass LLM post-processing system that significantly improves French clinical speech transcription accuracy by alternating between speaker recognition and word recognition passes. The system achieved significant word error rate reductions in suicide prevention conversations while maintaining stability in neurosurgery consultations with feasible computational costs for clinical deployment.
AINeutralarXiv – CS AI · Feb 274/102
🧠Researchers developed a robust framework for Bangla automatic speech recognition and speaker diarization that can handle long-form audio exceeding 30-60 seconds. The system uses Voice Activity Detection optimization and Connectionist Temporal Classification segmentation to maintain accuracy over extended durations in multi-speaker environments.
AINeutralHugging Face Blog · Nov 214/108
🧠The article title suggests coverage of the Open ASR (Automatic Speech Recognition) Leaderboard, focusing on trends and insights with new multilingual and long-form evaluation tracks. However, the article body appears to be empty or not provided, limiting the ability to extract specific details about ASR developments.
AIBullishHugging Face Blog · May 15/106
🧠The article appears to discuss advanced AI speech processing technologies including Automatic Speech Recognition (ASR), speaker diarization, and speculative decoding capabilities available through Hugging Face Inference Endpoints. However, the article body content is not provided for detailed analysis.
AINeutralHugging Face Blog · Jan 194/104
🧠The article appears to be about fine-tuning W2V2-Bert (Wav2Vec2-BERT) for automatic speech recognition in low-resource languages using Hugging Face Transformers. However, the article body is empty, preventing detailed analysis of the technical implementation or methodology.
AIBullishHugging Face Blog · Dec 204/104
🧠The article title suggests a technical advancement in Whisper inference using speculative decoding to achieve 2x faster processing speeds. However, no article body content was provided to analyze the specific implementation or implications.
AINeutralHugging Face Blog · Jun 194/106
🧠The article discusses fine-tuning MMS (Massively Multilingual Speech) adapter models for automatic speech recognition (ASR) in low-resource language scenarios. This approach aims to improve speech recognition performance for languages with limited training data by leveraging pre-trained multilingual models and adapter techniques.
AINeutralHugging Face Blog · Nov 34/106
🧠The article appears to discuss fine-tuning Whisper, OpenAI's automatic speech recognition model, for multilingual applications using Hugging Face Transformers library. However, the article body is empty, making detailed analysis impossible.
AINeutralHugging Face Blog · Feb 14/107
🧠The article appears to discuss implementing automatic speech recognition for processing large audio files using Wav2Vec2 model in Hugging Face Transformers library. However, the article body is empty, preventing detailed analysis of the technical implementation or implications.
AINeutralHugging Face Blog · Jan 124/105
🧠The article appears to discuss technical improvements to Wav2Vec2, a speech recognition model, by incorporating n-gram language models within the Hugging Face Transformers library. This represents an advancement in AI speech processing technology that could enhance accuracy and performance of speech-to-text applications.