y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#speech-processing News & Analysis

14 articles tagged with #speech-processing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles
AINeutralarXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

What Counts as Real? Speech Restoration and Voice Quality Conversion Pose New Challenges to Deepfake Detection

Researchers demonstrate that current audio deepfake detection systems incorrectly classify legitimate speech processing technologies like voice conversion and restoration as fake audio. A new multi-class detection approach shows improved accuracy by distinguishing between authentic speech, benign modifications, and actual spoofing attempts.

AINeutralarXiv โ€“ CS AI ยท 4d ago6/10
๐Ÿง 

Practical Bayesian Inference for Speech SNNs: Uncertainty and Loss-Landscape Smoothing

Researchers demonstrate that applying Bayesian inference to Spiking Neural Networks (SNNs) for speech processing smooths the irregular loss landscape caused by threshold-based spike generation. Testing on speech datasets shows improved performance metrics and more regular predictive landscapes compared to deterministic approaches.

AINeutralarXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

Evaluation of Audio Language Models for Fairness, Safety, and Security

Researchers introduce a structural taxonomy and unified evaluation framework for Audio Large Language Models (ALLMs) to assess fairness, safety, and security. The study reveals systematic differences in how ALLMs handle audio versus text inputs, with FSS behavior closely tied to acoustic information integration methods.

AIBullisharXiv โ€“ CS AI ยท Mar 116/10
๐Ÿง 

Latent Speech-Text Transformer

Facebook Research introduces the Latent Speech-Text Transformer (LST), which aggregates speech tokens into higher-level patches to improve computational efficiency and cross-modal alignment. The model achieves up to +6.5% absolute gain on speech HellaSwag benchmarks while maintaining text performance and reducing inference costs for ASR and TTS tasks.

AINeutralarXiv โ€“ CS AI ยท Mar 36/105
๐Ÿง 

A SUPERB-Style Benchmark of Self-Supervised Speech Models for Audio Deepfake Detection

Researchers introduced Spoof-SUPERB, a new benchmark for evaluating self-supervised learning models' ability to detect audio deepfakes. The study tested 20 SSL models and found that large-scale discriminative models like XLS-R and WavLM Large consistently outperformed others, especially under acoustic degradations.

AIBullisharXiv โ€“ CS AI ยท Feb 275/103
๐Ÿง 

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Researchers developed Lipi-Ghor-882, an 882-hour Bengali speech dataset, and demonstrated that targeted fine-tuning with synthetic acoustic degradation significantly improves automatic speech recognition for long-form Bengali audio. Their dual pipeline achieved a 0.019 Real-Time Factor, establishing new benchmarks for low-resource speech processing.

AINeutralarXiv โ€“ CS AI ยท 3d ago5/10
๐Ÿง 

Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features

Researchers developed a lightweight machine learning system that detects voicemail greetings versus live human answers in real-time telephony audio with 96.1% accuracy using only temporal speech activity patterns. The system processes calls in 46ms on standard CPUs and has been validated across 77,000 production calls, achieving practical false positive and negative rates suitable for AI calling applications.

AINeutralarXiv โ€“ CS AI ยท Mar 124/10
๐Ÿง 

AMB-DSGDN: Adaptive Modality-Balanced Dynamic Semantic Graph Differential Network for Multimodal Emotion Recognition

Researchers propose AMB-DSGDN, a new AI system for multimodal emotion recognition that uses adaptive modality balancing and differential graph attention mechanisms. The system addresses limitations in existing approaches by filtering noise and preventing dominant modalities from overwhelming the fusion process in text, speech, and visual data.

AINeutralarXiv โ€“ CS AI ยท Mar 44/104
๐Ÿง 

Differentiable Time-Varying IIR Filtering for Real-Time Speech Denoising

Researchers have developed TVF (Time-Varying Filtering), a lightweight 1 million parameter speech enhancement model that combines digital signal processing with deep learning for real-time speech denoising. The model uses a neural network to predict coefficients for a 35-band IIR filter cascade, offering interpretable processing while adapting dynamically to changing noise conditions.

AINeutralarXiv โ€“ CS AI ยท Feb 273/107
๐Ÿง 

Quantity Convergence, Quality Divergence: Disentangling Fluency and Accuracy in L2 Mandarin Prosody

This linguistic research study analyzes how Vietnamese learners of Mandarin Chinese acquire prosodic patterns, finding that advanced learners achieve native-like quantity in speech boundaries but develop inverted structural mapping patterns. The study reveals a trade-off between maintaining fluent output and achieving accurate prosodic structure in second language acquisition.