y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#audio-processing News & Analysis

19 articles tagged with #audio-processing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

19 articles
AIBearisharXiv – CS AI Β· 2d ago7/10
🧠

Demographic and Linguistic Bias Evaluation in Omnimodal Language Models

Researchers evaluated four omnimodal AI models across text, image, audio, and video processing, finding substantial demographic and linguistic biases particularly in audio understanding tasks. The study reveals significant accuracy disparities across age, gender, language, and skin tone, with audio tasks showing prediction collapse toward narrow categories, highlighting fairness concerns as these models see wider real-world deployment.

AIBearisharXiv – CS AI Β· Mar 177/10
🧠

$\tau$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

Researchers introduce Ο„-voice, a new benchmark for evaluating full-duplex voice AI agents on complex real-world tasks. The study reveals significant performance gaps, with voice agents achieving only 30-45% of text-based AI capability under realistic conditions with noise and diverse accents.

🧠 GPT-5
AINeutralarXiv – CS AI Β· Mar 117/10
🧠

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Researchers introduce MUGEN, a comprehensive benchmark revealing significant weaknesses in large audio-language models when processing multiple concurrent audio inputs. The study shows performance degrades sharply with more audio inputs and proposes Audio-Permutational Self-Consistency as a training-free solution, achieving up to 6.74% accuracy improvements.

AIBullishOpenAI News Β· May 137/107
🧠

Hello GPT-4o

OpenAI has announced GPT-4 Omni (GPT-4o), their new flagship AI model that can process and reason across audio, vision, and text simultaneously in real-time. This represents a significant advancement in multimodal AI capabilities, potentially setting a new standard for AI model functionality.

AINeutralarXiv – CS AI Β· Mar 276/10
🧠

TAAC: A gate into Trustable Audio Affective Computing

Researchers have developed TAAC, a framework for trustable audio-based depression diagnosis that protects user identity information while maintaining diagnostic accuracy. The system uses adversarial loss-based subspace decomposition to separate depression features from sensitive identity data, enabling secure AI-powered mental health screening.

AINeutralarXiv – CS AI Β· Mar 116/10
🧠

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Researchers introduce SCENEBench, a new benchmark for evaluating Large Audio Language Models (LALMs) beyond speech recognition, focusing on real-world audio understanding including background sounds, noise localization, and vocal characteristics. Testing of five state-of-the-art models revealed significant performance gaps, with some tasks performing below random chance while others achieved high accuracy.

AIBullisharXiv – CS AI Β· Mar 36/106
🧠

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

Researchers introduce TripleSumm, a novel AI architecture that adaptively fuses visual, text, and audio modalities for improved video summarization. The team also releases MoSu, the first large-scale benchmark dataset providing all three modalities for multimodal video summarization research.

AIBullisharXiv – CS AI Β· Mar 26/1015
🧠

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Researchers developed Whisper-LLaDA, a diffusion-based large language model for automatic speech recognition that achieves 12.3% relative improvement over baseline models. The study demonstrates that audio-conditioned embeddings are crucial for accuracy improvements, while plain-text processing without acoustic features fails to enhance performance.

AIBullisharXiv – CS AI Β· Mar 27/1014
🧠

VoiceBridge: General Speech Restoration with One-step Latent Bridge Models

VoiceBridge is a new AI model that can restore high-quality 48kHz speech from various types of audio distortions using a single one-step process. The model uses a latent bridge approach with an energy-preserving variational autoencoder and transformer architecture to handle multiple speech restoration tasks simultaneously.

AIBullisharXiv – CS AI Β· Feb 275/107
🧠

Learning to reconstruct from saturated data: audio declipping and high-dynamic range imaging

Researchers have developed a self-supervised learning method that can reconstruct audio and images from clipped/saturated measurements without requiring ground truth training data. The approach extends self-supervised learning to non-linear inverse problems and performs nearly as well as fully supervised methods while using only clipped measurements for training.

AIBullisharXiv – CS AI Β· Feb 276/106
🧠

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Researchers developed an unbiased sliced Wasserstein RBF kernel with rotary positional embedding to improve audio captioning systems by addressing exposure bias and temporal relationship issues. The method shows significant improvements in caption quality and text-to-audio retrieval accuracy on AudioCaps and Clotho datasets, while also enhancing audio reasoning capabilities in large language models.

AIBullisharXiv – CS AI Β· Mar 174/10
🧠

LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence

Researchers have developed LAMB, a new AI framework that improves automated audio captioning by better aligning audio features with large language models through Cauchy-Schwarz divergence optimization. The system achieved state-of-the-art performance on AudioCaps dataset by bridging the modality gap between audio and text embeddings.

AIBullisharXiv – CS AI Β· Mar 54/10
🧠

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

Researchers introduced LadderSym, a new Transformer-based AI method for detecting music practice errors that significantly outperforms existing approaches. The system uses multimodal processing of audio and symbolic music scores, more than doubling accuracy for detecting missed notes and improving extra note detection by 14.4 points.

AINeutralarXiv – CS AI Β· Mar 44/103
🧠

On the Parameter Estimation of Sinusoidal Models for Speech and Audio Signals

Research paper compares three sinusoidal models for speech and audio signal processing: standard Sinusoidal Model (SM), Exponentially Damped Sinusoidal Model (EDSM), and extended adaptive Quasi-Harmonic Model (eaQHM). The study finds eaQHM performs better for medium-to-large window analysis while EDSM excels with smaller analysis windows, suggesting future research should combine both approaches.

AINeutralarXiv – CS AI Β· Mar 44/103
🧠

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Researchers introduce Whisper-RIR-Mega, a new benchmark dataset for testing automatic speech recognition robustness in reverberant acoustic environments. The study evaluates five Whisper models and finds that reverberation consistently degrades performance across all model sizes, with word error rates increasing by 0.12 to 1.07 percentage points.