#speech-ai News & Analysis

8 articles tagged with #speech-ai. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AIBullisharXiv – CS AI · Jun 117/10

🧠

Towards Data-free and Training-free Compression for Speech Foundation Models Using Parameter Clustering

Researchers present a novel compression technique for speech foundation models using parameter clustering and k-means pruning without requiring training data or fine-tuning. The method demonstrates significant performance improvements over traditional magnitude-based pruning on HuBERT-large and Whisper-large-v3, with 27-59% relative WER reductions at various sparsity levels.

AINeutralarXiv – CS AI · Jun 256/10

🧠

STEB: A Speech-to-Speech Translation Expressiveness Benchmark for Evaluating Beyond Translation Fidelity

Researchers introduced STEB, a new benchmark for evaluating speech-to-speech translation systems on both translation accuracy and emotional expressiveness preservation. Testing six systems revealed that while translation fidelity is strong, emotion and nonverbal vocalization preservation remain significant challenges, highlighting a critical gap in current AI capabilities.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

Researchers propose a code-mixing guided synthetic speech generation framework to improve automatic speech recognition (ASR) for multilingual code-switching scenarios. By optimizing synthetic data generation using the Code Mixing Index metric, the method demonstrates significant error rate reductions on Mandarin-English speech datasets, addressing a critical limitation in training data availability for code-switched ASR systems.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents

Researchers have released Afrispeech Semantics, a comprehensive benchmark evaluating how well audio language models perform semantic reasoning tasks beyond basic transcription. The study tests models across five key areas including entailment, consistency, plausibility, and accent variation, revealing significant gaps in current audio AI systems' ability to understand spoken language nuances.

AINeutralarXiv – CS AI · Jun 16/10

🧠

A Unified and Reproducible Experimentation Framework for Speech Understanding

Researchers introduce SURE, a unified experimentation framework that standardizes evaluation metrics and training pipelines for speech understanding models, addressing reproducibility challenges that have hindered fair comparison of speech foundation models and Speech LLMs across different deployment scenarios.

AIBullisharXiv – CS AI · Mar 276/10

🧠

Voxtral TTS

Voxtral TTS is a new multilingual text-to-speech AI model that can generate natural speech from just 3 seconds of reference audio. In human evaluations, it achieved a 68.4% win rate over ElevenLabs Flash v2.5 for voice cloning, demonstrating superior naturalness and expressivity.

AINeutralarXiv – CS AI · Feb 275/107

🧠

Same Words, Different Judgments: Modality Effects on Preference Alignment

Researchers conducted a cross-modal study comparing human preference annotations between text and audio formats for AI alignment. The study found that while audio preferences are as reliable as text, different modalities lead to different judgment patterns, with synthetic ratings showing promise as replacements for human annotations.

$NEAR

AINeutralarXiv – CS AI · Mar 24/106

🧠

Task-Lens: Cross-Task Utility Based Speech Dataset Profiling for Low-Resource Indian Languages

Researchers propose Task-Lens, a cross-task survey analyzing 50 Indian speech datasets across 26 languages for nine downstream speech tasks. The study reveals untapped metadata in existing datasets that could support multiple AI speech applications and identifies critical gaps in resources for underserved Indian languages.