#speech-processing News & Analysis

19 articles tagged with #speech-processing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

19 articles

AIBullishMarkTechPost · Mar 267/10

🧠

Tencent AI Open Sources Covo-Audio: A 7B Speech Language Model and Inference Pipeline for Real-Time Audio Conversations and Reasoning

Tencent AI Lab has open-sourced Covo-Audio, a 7B-parameter Large Audio Language Model that can process continuous audio inputs and generate audio outputs in real-time. The model unifies speech processing and language intelligence within a single end-to-end architecture designed for seamless cross-modal interaction.

AINeutralarXiv – CS AI · Mar 177/10

🧠

What Counts as Real? Speech Restoration and Voice Quality Conversion Pose New Challenges to Deepfake Detection

Researchers demonstrate that current audio deepfake detection systems incorrectly classify legitimate speech processing technologies like voice conversion and restoration as fake audio. A new multi-class detection approach shows improved accuracy by distinguishing between authentic speech, benign modifications, and actual spoofing attempts.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

This survey comprehensively reviews end-to-end neural architectures for multi-speaker automatic speech recognition on monaural audio, analyzing SIMO vs. SISO paradigms, recent algorithmic improvements, and extensions to long-form speech. The work addresses a critical gap in literature by systematizing recent advances in a field transitioning from cascade to unified E2E systems that better handle overlapping speech and speaker attribution.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation

Researchers challenge the widespread practice of using global token perplexity to evaluate generative spoken language models, arguing this metric fails to account for fundamental differences between speech and text modalities. The study proposes alternative likelihood- and generative-based evaluation methods that correlate more strongly with human perception, revealing that performance gaps between leading models and human baselines are smaller than previously believed.

🏢 Perplexity

AIBullisharXiv – CS AI · 5d ago6/10

🧠

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

Researchers have released ParsVoice, a 2,200-hour Persian speech dataset with 1.36 million aligned segments from 1,815 speakers, making it 25 times larger than previous Persian TTS resources. The dataset was constructed using an automated pipeline combining ASR, fine-tuned language models, and quality assessment, and validation shows the corpus enables multi-speaker text-to-speech systems competitive with existing solutions.

🏢 Hugging Face

AINeutralarXiv – CS AI · Apr 206/10

🧠

DASB -- Discrete Audio and Speech Benchmark

Researchers introduce DASB, a comprehensive benchmark framework for evaluating discrete audio tokens across speech, audio, and music domains. The study reveals that discrete representations lag behind continuous features and require significant tuning, with semantic tokens outperforming acoustic ones, establishing standardized evaluation protocols for multimodal AI systems.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Reading Between the Lines: The One-Sided Conversation Problem

Researchers formalize the one-sided conversation problem (1SC), where only one participant's dialogue can be recorded—common in telemedicine, call centers, and smart glasses. The study evaluates methods to reconstruct missing speaker turns and generate summaries from incomplete transcripts, finding that smaller models require finetuning while larger models show promise with prompting techniques.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing

Researchers demonstrate that applying Bayesian inference to Spiking Neural Networks (SNNs) for speech processing smooths the irregular loss landscape caused by threshold-based spike generation. Testing on speech datasets shows improved performance metrics and more regular predictive landscapes compared to deterministic approaches.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Evaluation of Audio Language Models for Fairness, Safety, and Security

Researchers introduce a structural taxonomy and unified evaluation framework for Audio Large Language Models (ALLMs) to assess fairness, safety, and security. The study reveals systematic differences in how ALLMs handle audio versus text inputs, with FSS behavior closely tied to acoustic information integration methods.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Nudging Hidden States: Training-Free Model Steering for Chain-of-Thought Reasoning in Large Audio-Language Models

Researchers developed training-free model steering techniques to improve reasoning in large audio-language models (LALMs) through chain-of-thought prompting. The approach achieved up to 4.4% accuracy gains and demonstrated cross-modal transfer where text-derived steering vectors can effectively guide speech-based reasoning.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Latent Speech-Text Transformer

Facebook Research introduces the Latent Speech-Text Transformer (LST), which aggregates speech tokens into higher-level patches to improve computational efficiency and cross-modal alignment. The model achieves up to +6.5% absolute gain on speech HellaSwag benchmarks while maintaining text performance and reducing inference costs for ASR and TTS tasks.

AINeutralarXiv – CS AI · Mar 36/105

🧠

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Researchers introduced Spoof-SUPERB, a new benchmark for evaluating self-supervised learning models' ability to detect audio deepfakes. The study tested 20 SSL models and found that large-scale discriminative models like XLS-R and WavLM Large consistently outperformed others, especially under acoustic degradations.

AIBullisharXiv – CS AI · Feb 275/103

🧠

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Researchers developed Lipi-Ghor-882, an 882-hour Bengali speech dataset, and demonstrated that targeted fine-tuning with synthetic acoustic degradation significantly improves automatic speech recognition for long-form Bengali audio. Their dual pipeline achieved a 0.019 Real-Time Factor, establishing new benchmarks for low-resource speech processing.

AIBullishHugging Face Blog · Apr 96/105

🧠

Hugging Face and Cloudflare Partner to Make Real-Time Speech and Video Seamless with FastRTC

Hugging Face and Cloudflare have partnered to launch FastRTC, a solution designed to enable seamless real-time speech and video processing. This collaboration combines Hugging Face's AI models with Cloudflare's edge computing infrastructure to reduce latency in real-time communications.

AINeutralarXiv – CS AI · Apr 145/10

🧠

Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features

Researchers developed a lightweight machine learning system that detects voicemail greetings versus live human answers in real-time telephony audio with 96.1% accuracy using only temporal speech activity patterns. The system processes calls in 46ms on standard CPUs and has been validated across 77,000 production calls, achieving practical false positive and negative rates suitable for AI calling applications.

AINeutralarXiv – CS AI · Mar 124/10

🧠

AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

Researchers propose AMB-DSGDN, a new AI system for multimodal emotion recognition that uses adaptive modality balancing and differential graph attention mechanisms. The system addresses limitations in existing approaches by filtering noise and preventing dominant modalities from overwhelming the fusion process in text, speech, and visual data.

AINeutralarXiv – CS AI · Mar 44/104

🧠

Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

Researchers have developed TVF (Time-Varying Filtering), a lightweight 1 million parameter speech enhancement model that combines digital signal processing with deep learning for real-time speech denoising. The model uses a neural network to predict coefficients for a 35-band IIR filter cascade, offering interpretable processing while adapting dynamically to changing noise conditions.

AINeutralarXiv – CS AI · Mar 34/103

🧠

CodecFlow: Efficient Bandwidth Extension via Conditional Flow Matching in Neural Codec Latent Space

CodecFlow is a new neural codec-based framework for speech bandwidth extension that efficiently reconstructs high-quality audio in compact latent space. The system uses conditional flow matching and residual vector quantization to improve speech clarity by restoring high-frequency content from low-bandwidth audio.

AINeutralarXiv – CS AI · Feb 273/107

🧠

Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

This linguistic research study analyzes how Vietnamese learners of Mandarin Chinese acquire prosodic patterns, finding that advanced learners achieve native-like quantity in speech boundaries but develop inverted structural mapping patterns. The study reveals a trade-off between maintaining fluent output and achieving accurate prosodic structure in second language acquisition.