AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers have developed a comprehensive taxonomy of jailbreak attacks and defenses for Large Audio Language Models (LALMs), identifying vulnerabilities across semantic, acoustic, signal, and embedding layers. The study reveals that current defenses create tradeoffs between robustness and usability, highlighting the need for cost-aware safety evaluation beyond simple success-rate metrics.
AIBullisharXiv – CS AI · 2d ago7/10
🧠HoliTok is a new continuous speech tokenization model that unifies speech generation and understanding tasks by encoding 48kHz audio into compact 128-dimensional latent sequences at 25Hz. The breakthrough addresses a key challenge in building unified speech foundation models by creating a tokenization space that balances reconstruction fidelity, semantic preservation, and learnability without requiring architectural workarounds.
AIBullishDecrypt – AI · 5d ago7/10
🧠StepFun, a Shanghai-based AI lab known for developing efficient large language models, has achieved top benchmark results in voice AI technology with notable sensitivity to acoustic nuances like sighs. The breakthrough demonstrates the lab's capability to extend its LLM expertise into multimodal AI, potentially reshaping voice recognition and AI assistant markets.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers introduce WorldSpeech, a multilingual speech corpus containing 65,000 hours of aligned audio-transcript data across 76 languages, addressing the critical gap in ASR training data for low-resource languages. Fine-tuning existing ASR models on this dataset achieves an average 63.5% relative Word-Error-Rate reduction, significantly improving speech recognition accuracy for underrepresented languages.
AIBullishOpenAI News · May 77/10
🧠OpenAI has introduced new realtime voice models in its API that enable advanced capabilities including reasoning, translation, and speech transcription. These models represent a significant step toward more natural and intelligent voice-based interactions, expanding the practical applications available to developers building voice-enabled applications.
🏢 OpenAI
AIBullisharXiv – CS AI · Mar 277/10
🧠Ming-Flash-Omni is a new 100 billion parameter multimodal AI model with Mixture-of-Experts architecture that uses only 6.1 billion active parameters per token. The model demonstrates unified capabilities across vision, speech, and language tasks, achieving performance comparable to Gemini 2.5 Pro on vision-language benchmarks.
🧠 Gemini
AIBullisharXiv – CS AI · Mar 267/10
🧠Alberta Health Services deployed Berta, an open-source AI scribe platform that reduces clinical documentation costs by 70-95% compared to commercial alternatives. The system was used by 198 emergency physicians across 105 facilities, generating over 22,000 clinical sessions while keeping all data within secure health system infrastructure.
AIBearisharXiv – CS AI · Mar 177/10
🧠Researchers developed SWhisper, a framework that uses near-ultrasonic audio to deliver covert jailbreak attacks against speech-driven AI systems. The technique is inaudible to humans but can successfully bypass AI safety measures with up to 94% effectiveness on commercial models.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers have released WAXAL, a large-scale multilingual speech dataset covering 24 Sub-Saharan African languages representing over 100 million speakers. The dataset includes 1,250 hours of transcribed speech for ASR and 235 hours of high-quality recordings for TTS, released under CC-BY-4.0 license to advance inclusive AI technologies.
AIBullishMIT News – AI · Dec 57/106
🧠MIT researchers have developed a speech-to-reality system that combines 3D generative AI with robotic assembly to create physical objects on demand from voice commands. The technology represents a significant advancement in AI-driven manufacturing and automation capabilities.
AIBullishOpenAI News · Apr 247/106
🧠OpenAI has released APIs for ChatGPT and Whisper models, allowing developers to integrate these AI capabilities directly into their applications and products. This marks a significant step in making advanced conversational AI and speech recognition technology accessible to third-party developers.
AIBullishOpenAI News · Sep 217/107
🧠OpenAI has trained and open-sourced Whisper, a neural network for speech recognition that achieves human-level robustness and accuracy on English speech. The model represents a significant advancement in AI speech recognition technology and is being made freely available to the community.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers introduce Agentic ASR, a multi-turn interactive speech recognition framework that enables iterative refinement of recognized speech through semantic correction and reasoning-based editing. The approach addresses limitations of single-pass ASR systems by aligning with human communication patterns, introducing a new semantic evaluation metric (S²ER) that better captures meaning-critical errors than traditional token-level metrics.
AINeutralarXiv – CS AI · 4d ago6/10
🧠Researchers introduce MetaSICL, a post-training method that enhances auditory large language models' ability to learn from in-context demonstrations without fine-tuning. The approach uses high-resource speech data to improve performance on low-resource tasks, outperforming traditional fine-tuning methods when labeled data is scarce or domain-mismatched.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers present a unified mathematical framework for Test-Time Adaptation (TTA) in autoregressive generative models, decomposing entropy minimization into token-level policy gradient and entropy losses. Validated on Whisper ASR across 20+ domains, the approach demonstrates consistent performance improvements and reconciles previously disparate adaptation methods under a single theoretical foundation.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers have developed Bangla-WhisperDiar, a fine-tuned speech recognition and speaker diarization system that achieves a 24.41% word error rate for ASR and 23.92% diarization error rate. The work addresses critical gaps in Bangla language processing by combining OpenAI's Whisper model with PyAnnote's diarization framework, trained on custom datasets with extensive data augmentation techniques.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers propose Interactive ASR, a new framework that combines semantic-aware evaluation using LLM-as-a-Judge with multi-turn interactive correction to improve automatic speech recognition beyond traditional word error rate metrics. The approach simulates human-like interaction, enabling iterative refinement of recognition outputs across English, Chinese, and code-switching datasets.
AIBearisharXiv – CS AI · Mar 276/10
🧠Researchers introduced WildASR, a multilingual diagnostic benchmark revealing that current ASR systems suffer severe performance degradation in real-world conditions despite achieving near-human accuracy on curated tests. The study found that ASR models often hallucinate plausible but unspoken content under degraded inputs, creating safety risks for voice agents.
AIBullishMarkTechPost · Mar 266/10
🧠Cohere AI has released Cohere Transcribe, a new state-of-the-art Automatic Speech Recognition (ASR) model designed for enterprise applications. This marks the company's expansion beyond text generation and embedding models into the speech recognition market, targeting enterprise speech intelligence solutions.
🏢 Cohere
AIBullishMarkTechPost · Mar 176/10
🧠Google AI has released WAXAL, an open multilingual speech dataset covering 24 African languages to improve Automatic Speech Recognition and Text-to-Speech systems. This addresses the significant data distribution problem where African languages remain poorly represented in speech technology training corpora.
🏢 Google
AIBullishMarkTechPost · Mar 166/10
🧠IBM has released Granite 4.0 1B Speech, a compact multilingual speech-language model optimized for automatic speech recognition and translation. The model is specifically designed for enterprise and edge deployments where memory efficiency, low latency, and compute optimization are critical alongside performance quality.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers developed a protocol to evaluate speaker verification capabilities in speech-aware large language models, finding weak performance with error rates above 20%. They introduced ECAPA-LLM, a lightweight augmentation that achieves 1.03% error rate by integrating speaker embeddings while maintaining natural language interface.
AINeutralarXiv – CS AI · Mar 116/10
🧠Researchers introduce SCENEBench, a new benchmark for evaluating Large Audio Language Models (LALMs) beyond speech recognition, focusing on real-world audio understanding including background sounds, noise localization, and vocal characteristics. Testing of five state-of-the-art models revealed significant performance gaps, with some tasks performing below random chance while others achieved high accuracy.
AIBullisharXiv – CS AI · Mar 45/103
🧠Researchers developed GLoRIA, a parameter-efficient framework for automatic speech recognition that adapts to regional dialects using location metadata. The system achieves state-of-the-art performance while updating less than 10% of model parameters and demonstrates strong generalization to unseen dialects.
AIBullisharXiv – CS AI · Mar 37/107
🧠Researchers introduce Whisper-MLA, a modified version of OpenAI's Whisper speech recognition model that uses Multi-Head Latent Attention to reduce GPU memory consumption by up to 87.5% while maintaining accuracy. The innovation addresses a key scalability issue with transformer-based ASR models when processing long-form audio.