#speech-recognition News & Analysis

93 articles tagged with #speech-recognition. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

93 articles

AINeutralarXiv – CS AI · Jun 237/10

🧠

HALAS: A Human-Annotated Dataset of Hallucinations of Modern ASR Systems

Researchers introduce HALAS, the first human-annotated dataset documenting naturally occurring hallucinations from seven state-of-the-art ASR systems on real earnings call recordings. The benchmark reveals that hallucinations persist even in nearly correct transcriptions and establishes rigorous evaluation methods, with current detection techniques achieving only 53.1% F1 scores despite character-level metrics reaching 81% ROC-AUC.

AIBullishCrypto Briefing · Jun 237/10

🧠

OpenAI prepares ChatGPT voice upgrade with Bidi 1 model

OpenAI is developing the GPT-Bidi-1 model designed to enhance ChatGPT's voice capabilities with improved real-time conversational fluidity and adaptability. This advancement represents a significant upgrade to AI voice interaction technology that could reshape how users engage with conversational AI systems.

🏢 OpenAI🧠 ChatGPT

AIBullisharXiv – CS AI · Jun 107/10

🧠

Whisfusion: Parallel ASR Decoding with Masked Diffusion

Whisfusion introduces a masked diffusion decoder that achieves faster speech-to-text processing than Whisper-large-v3 while matching or exceeding its accuracy across multilingual benchmarks. By replacing autoregressive decoding with parallel diffusion decoding, the system runs 4-5x faster while maintaining competitive performance with leading ASR systems, establishing non-autoregressive diffusion as a viable paradigm for high-throughput transcription.

AIBullisharXiv – CS AI · Jun 97/10

🧠

FormalASR: End-to-End Spoken Chinese to Formal Text

Researchers present FormalASR, compact end-to-end models that convert spoken Chinese directly into formal written text, eliminating the need for post-processing with large language models. Built on newly created datasets and fine-tuned versions of Qwen3-ASR, the solution achieves significant error reduction while enabling lightweight on-device deployment.

AIBearisharXiv – CS AI · Jun 87/10

🧠

Hearing the Unspoken: Language Model Priors for Acoustic Adversarial Attacks

Researchers demonstrate a new adversarial attack called Semantic Gambit that exploits Large Language Models to significantly compromise real-time Automatic Speech Recognition systems. By leveraging predictive context from LLMs, the attack achieves a 35.6% Word Error Rate—three times higher than previously documented attacks—revealing a critical vulnerability in ASR pipelines that operate under temporal constraints.

AIBullisharXiv – CS AI · Jun 87/10

🧠

Whisper Hallucination Detection and Mitigation via Hidden Representation Steering and Sparse AutoEncoders

Researchers demonstrate that Whisper, OpenAI's widely-used speech recognition model, can detect and mitigate hallucinations—fabricated coherent transcriptions from non-speech audio—using Sparse AutoEncoders and activation-space steering. The approach reduces hallucination rates from 72-87% to 14-27% across model sizes with minimal performance degradation on actual speech.

AIBearisharXiv – CS AI · Jun 57/10

🧠

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

Researchers have developed a new adversarial attack method against automatic speech recognition systems that operates in feature space rather than directly on audio waveforms, achieving significantly higher transfer rates to black-box ASR models and bypassing existing defenses. The attack uses self-supervised learning representations and vocoders to reconstruct adversarial signals, revealing critical vulnerabilities in current ASR robustness evaluation protocols.

AIBullisharXiv – CS AI · Jun 47/10

🧠

Audio Interaction Model

Researchers introduce Audio-Interaction, a unified streaming model that enables Large Audio Language Models to process audio in real time through a perceive-decide-respond loop, handling tasks from speech recognition to voice chatting. The framework, SoundFlow, includes a new 2.6M-item streaming corpus and demonstrates competitive performance on mainstream audio tasks while unlocking real-time interactive capabilities previously unavailable to offline models.

AIBullisharXiv – CS AI · Jun 27/10

🧠

MOSS-Audio Technical Report

MOSS-Audio is a unified audio-language model supporting speech, environmental sound, and music understanding with capabilities in captioning, question answering, and temporal grounding. The model introduces DeepStack cross-layer feature injection and time markers for explicit temporal cues, released in 4B and 8B variants for instruction-following and reasoning tasks.

AIBullisharXiv – CS AI · Jun 27/10

🧠

ASKD-Whisper: Adaptive Self-knowledge Distillation for Efficient and Low-Latency Automatic Speech Recognition

Researchers propose ASKD-Whisper, a new knowledge distillation technique that compresses OpenAI's Whisper speech recognition model while improving performance. The method achieves 5x faster inference and 1.07% lower error rates than the original teacher model by dynamically reducing reliance on the teacher's predictions during training.

AIBullisharXiv – CS AI · May 297/10

🧠

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

HoliTok is a new continuous speech tokenization model that unifies speech generation and understanding tasks by encoding 48kHz audio into compact 128-dimensional latent sequences at 25Hz. The breakthrough addresses a key challenge in building unified speech foundation models by creating a tokenization space that balances reconstruction fidelity, semantic preservation, and learnability without requiring architectural workarounds.

AIBearisharXiv – CS AI · May 297/10

🧠

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Researchers have developed a comprehensive taxonomy of jailbreak attacks and defenses for Large Audio Language Models (LALMs), identifying vulnerabilities across semantic, acoustic, signal, and embedding layers. The study reveals that current defenses create tradeoffs between robustness and usability, highlighting the need for cost-aware safety evaluation beyond simple success-rate metrics.

AIBullishDecrypt – AI · May 267/10

🧠

StepFun's Voice AI Topped Every Benchmark. It Also Hears Your Sighs

StepFun, a Shanghai-based AI lab known for developing efficient large language models, has achieved top benchmark results in voice AI technology with notable sensitivity to acoustic nuances like sighs. The breakthrough demonstrates the lab's capability to extend its LLM expertise into multimodal AI, potentially reshaping voice recognition and AI assistant markets.

AIBullisharXiv – CS AI · May 127/10

🧠

WorldSpeech: A Multilingual Speech Corpus from Around the World

Researchers introduce WorldSpeech, a multilingual speech corpus containing 65,000 hours of aligned audio-transcript data across 76 languages, addressing the critical gap in ASR training data for low-resource languages. Fine-tuning existing ASR models on this dataset achieves an average 63.5% relative Word-Error-Rate reduction, significantly improving speech recognition accuracy for underrepresented languages.

AIBullishOpenAI News · May 77/10

🧠

Advancing voice intelligence with new models in the API

OpenAI has introduced new realtime voice models in its API that enable advanced capabilities including reasoning, translation, and speech transcription. These models represent a significant step toward more natural and intelligent voice-based interactions, expanding the practical applications available to developers building voice-enabled applications.

🏢 OpenAI

AIBullisharXiv – CS AI · Mar 277/10

🧠

Ming-Flash-Omni: A Sparse, Unified Architecture for Multimodal Perception and Generation

Ming-Flash-Omni is a new 100 billion parameter multimodal AI model with Mixture-of-Experts architecture that uses only 6.1 billion active parameters per token. The model demonstrates unified capabilities across vision, speech, and language tasks, achieving performance comparable to Gemini 2.5 Pro on vision-language benchmarks.

🧠 Gemini

AIBullisharXiv – CS AI · Mar 267/10

🧠

Berta: an open-source, modular tool for AI-enabled clinical documentation

Alberta Health Services deployed Berta, an open-source AI scribe platform that reduces clinical documentation costs by 70-95% compared to commercial alternatives. The system was used by 198 emergency physicians across 105 facilities, generating over 22,000 clinical sessions while keeping all data within secure health system infrastructure.

AIBearisharXiv – CS AI · Mar 177/10

🧠

Sirens' Whisper: Inaudible Near-Ultrasonic Jailbreaks of Speech-Driven LLMs

Researchers developed SWhisper, a framework that uses near-ultrasonic audio to deliver covert jailbreak attacks against speech-driven AI systems. The technique is inaudible to humans but can successfully bypass AI safety measures with up to 94% effectiveness on commercial models.

AIBullisharXiv – CS AI · Mar 37/103

🧠

WAXAL: A Large-Scale Multilingual African Language Speech Corpus

Researchers have released WAXAL, a large-scale multilingual speech dataset covering 24 Sub-Saharan African languages representing over 100 million speakers. The dataset includes 1,250 hours of transcribed speech for ASR and 235 hours of high-quality recordings for TTS, released under CC-BY-4.0 license to advance inclusive AI technologies.

AIBullishMIT News – AI · Dec 57/106

🧠

MIT researchers “speak objects into existence” using AI and robotics

MIT researchers have developed a speech-to-reality system that combines 3D generative AI with robotic assembly to create physical objects on demand from voice commands. The technology represents a significant advancement in AI-driven manufacturing and automation capabilities.

AIBullishOpenAI News · Apr 247/106

🧠

Introducing ChatGPT and Whisper APIs

OpenAI has released APIs for ChatGPT and Whisper models, allowing developers to integrate these AI capabilities directly into their applications and products. This marks a significant step in making advanced conversational AI and speech recognition technology accessible to third-party developers.

AIBullishOpenAI News · Sep 217/107

🧠

Introducing Whisper

OpenAI has trained and open-sourced Whisper, a neural network for speech recognition that achieves human-level robustness and accuracy on English speech. The model represents a significant advancement in AI speech recognition technology and is being made freely available to the community.

AINeutralarXiv – CS AI · Jun 255/10

🧠

Phoneme-Level Mispronunciation Screening in Polish-Speaking Children with an Explainable Assistant

Researchers developed an AI-powered screening tool for detecting speech sound errors in Polish-speaking children, using wav2vec2 technology to identify sibilant substitutions. The system achieves 88.7% accuracy on a test set and demonstrates 72.9% precision with a 2.7% false-alarm rate, designed as a lightweight alternative to specialist evaluation for early intervention.

AIBullisharXiv – CS AI · Jun 256/10

🧠

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

Researchers propose an error-aware TF-IDF retrieval-augmented generation framework that corrects automatic speech recognition (ASR) errors by using phonetically-aware lexical matching rather than heavy cross-modal embeddings. The method achieved a 37.2 percentage-point improvement in error-aware hit rate and reduced word error rate by 4.23 points on Persian speech data with minimal computational overhead.

AINeutralarXiv – CS AI · Jun 236/10

🧠

DSSCNet: A Transfer Learning Framework for Cross-Corpus Dysarthric Speech Severity Classification

Researchers introduce DSSCNet, a deep learning framework using transfer learning to improve dysarthric speech severity classification across different datasets. The model achieves 75.80% accuracy on TORGO and 68.25% on UA-Speech corpora, demonstrating significant improvements in speaker-independent assessment and cross-corpus generalization for assistive speech technologies.

Page 1 of 4Next →