#audio-processing News & Analysis

23 articles tagged with #audio-processing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

23 articles

AIBullisharXiv – CS AI · 2d ago7/10

🧠

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Researchers introduce COMET, a PLS-SVD framework that analyzes the modality gap in Contrastive Language-Audio Pretraining (CLAP) models by decomposing embeddings into interpretable concepts. The study reveals that only a small subset of shared conceptual axes drives similarity computation, and proposes a training-free spectral truncation method that improves zero-shot audio captioning performance while reducing dimensionality.

AIBullisharXiv – CS AI · 4d ago7/10

🧠

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

StreamSplit introduces a novel framework enabling continuous contrastive learning on edge devices by dynamically partitioning computation between local and cloud resources. Using reinforcement learning and uncertainty guidance, the system reduces latency by up to 4.7x and bandwidth by 77.1% while maintaining near-server accuracy, making distributed AI inference practical for resource-constrained hardware.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Demographic and Linguistic Bias Evaluation in Omnimodal Language Models

Researchers evaluated four omnimodal AI models across text, image, audio, and video processing, finding substantial demographic and linguistic biases particularly in audio understanding tasks. The study reveals significant accuracy disparities across age, gender, language, and skin tone, with audio tasks showing prediction collapse toward narrow categories, highlighting fairness concerns as these models see wider real-world deployment.

AIBearisharXiv – CS AI · Mar 177/10

🧠

$\tau$-Voice: Benchmarking Full-Duplex Voice Agents on Real-World Domains

Researchers introduce τ-voice, a new benchmark for evaluating full-duplex voice AI agents on complex real-world tasks. The study reveals significant performance gaps, with voice agents achieving only 30-45% of text-based AI capability under realistic conditions with noise and diverse accents.

🧠 GPT-5

AINeutralarXiv – CS AI · Mar 117/10

🧠

MUGEN: Evaluating and Improving Multi-audio Understanding of Large Audio-Language Models

Researchers introduce MUGEN, a comprehensive benchmark revealing significant weaknesses in large audio-language models when processing multiple concurrent audio inputs. The study shows performance degrades sharply with more audio inputs and proposes Audio-Permutational Self-Consistency as a training-free solution, achieving up to 6.74% accuracy improvements.

AIBullisharXiv – CS AI · Mar 46/102

🧠

When Scaling Fails: Mitigating Audio Perception Decay of LALMs via Multi-Step Perception-Aware Reasoning

Researchers identified a critical problem in Large Audio-Language Models (LALMs) where audio perception deteriorates during extended reasoning processes. They developed MPAR² framework using reinforcement learning, which improved perception performance from 31.74% to 63.51% and achieved 74.59% accuracy on MMAU benchmark.

AIBullishOpenAI News · May 137/107

🧠

Hello GPT-4o

OpenAI has announced GPT-4 Omni (GPT-4o), their new flagship AI model that can process and reason across audio, vision, and text simultaneously in real-time. This represents a significant advancement in multimodal AI capabilities, potentially setting a new standard for AI model functionality.

AINeutralarXiv – CS AI · May 126/10

🧠

Remix the Timbre: Diffusion-Based Style Transfer Across Polyphonic Stems

Researchers introduce MixtureTT, a diffusion-based system for timbre transfer in polyphonic music that directly processes mixed audio rather than separating instruments first. The approach outperforms existing separate-then-transfer pipelines by modeling dependencies across multiple stems simultaneously, reducing inference costs and eliminating source separation artifacts.

AINeutralarXiv – CS AI · May 115/10

🧠

Dependence on Early and Late Reverberation of Single-Channel Speaker Distance Estimation

Researchers decomposed room impulse responses to understand which acoustic components enable single-channel speaker distance estimation, finding that without time calibration, models rely on early reflections and achieve 1.29m error, while time-calibrated models achieve 0.14m accuracy using propagation delay alone.

AINeutralarXiv – CS AI · Mar 276/10

🧠

TAAC: A gate into Trustable Audio Affective Computing

Researchers have developed TAAC, a framework for trustable audio-based depression diagnosis that protects user identity information while maintaining diagnostic accuracy. The system uses adversarial loss-based subspace decomposition to separate depression features from sensitive identity data, enabling secure AI-powered mental health screening.

AINeutralarXiv – CS AI · Mar 116/10

🧠

SCENEBench: An Audio Understanding Benchmark Grounded in Assistive and Industrial Use Cases

Researchers introduce SCENEBench, a new benchmark for evaluating Large Audio Language Models (LALMs) beyond speech recognition, focusing on real-world audio understanding including background sounds, noise localization, and vocal characteristics. Testing of five state-of-the-art models revealed significant performance gaps, with some tasks performing below random chance while others achieved high accuracy.

AIBullisharXiv – CS AI · Mar 36/106

🧠

TripleSumm: Adaptive Triple-Modality Fusion for Video Summarization

Researchers introduce TripleSumm, a novel AI architecture that adaptively fuses visual, text, and audio modalities for improved video summarization. The team also releases MoSu, the first large-scale benchmark dataset providing all three modalities for multimodal video summarization research.

AIBullisharXiv – CS AI · Mar 27/1010

🧠

TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

Researchers have developed TIGER, a new speech separation model that reduces parameters by 94.3% and computational costs by 95.3% while outperforming current state-of-the-art models. The team also introduced EchoSet, a new dataset with realistic acoustic environments that shows better generalization for speech separation models.

AIBullisharXiv – CS AI · Mar 26/1015

🧠

Audio-Conditioned Diffusion LLMs for ASR and Deliberation Processing

Researchers developed Whisper-LLaDA, a diffusion-based large language model for automatic speech recognition that achieves 12.3% relative improvement over baseline models. The study demonstrates that audio-conditioned embeddings are crucial for accuracy improvements, while plain-text processing without acoustic features fails to enhance performance.

AIBullisharXiv – CS AI · Mar 27/1014

🧠

VoiceBridge: General Speech Restoration with One-step Latent Bridge Models

VoiceBridge is a new AI model that can restore high-quality 48kHz speech from various types of audio distortions using a single one-step process. The model uses a latent bridge approach with an energy-preserving variational autoencoder and transformer architecture to handle multiple speech restoration tasks simultaneously.

AIBullisharXiv – CS AI · Mar 26/1020

🧠

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Researchers introduced Resp-Agent, an AI system that uses multimodal deep learning to generate respiratory sounds and diagnose diseases. The system addresses data scarcity and representation gaps in medical AI through an autonomous agent-based approach and includes a new benchmark dataset of 229k recordings.

$CA

AIBullisharXiv – CS AI · Feb 275/107

🧠

Learning to reconstruct from saturated data: audio declipping and high-dynamic range imaging

Researchers have developed a self-supervised learning method that can reconstruct audio and images from clipped/saturated measurements without requiring ground truth training data. The approach extends self-supervised learning to non-linear inverse problems and performs nearly as well as fully supervised methods while using only clipped measurements for training.

AIBullisharXiv – CS AI · Feb 276/106

🧠

Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning

Researchers developed an unbiased sliced Wasserstein RBF kernel with rotary positional embedding to improve audio captioning systems by addressing exposure bias and temporal relationship issues. The method shows significant improvements in caption quality and text-to-audio retrieval accuracy on AudioCaps and Clotho datasets, while also enhancing audio reasoning capabilities in large language models.

AIBullisharXiv – CS AI · Mar 174/10

🧠

LAMB: LLM-based Audio Captioning with Modality Gap Bridging via Cauchy-Schwarz Divergence

Researchers have developed LAMB, a new AI framework that improves automated audio captioning by better aligning audio features with large language models through Cauchy-Schwarz divergence optimization. The system achieved state-of-the-art performance on AudioCaps dataset by bridging the modality gap between audio and text embeddings.

AIBullisharXiv – CS AI · Mar 54/10

🧠

LadderSym: A Multimodal Interleaved Transformer for Music Practice Error Detection

Researchers introduced LadderSym, a new Transformer-based AI method for detecting music practice errors that significantly outperforms existing approaches. The system uses multimodal processing of audio and symbolic music scores, more than doubling accuracy for detecting missed notes and improving extra note detection by 14.4 points.

AINeutralarXiv – CS AI · Mar 44/103

🧠

On the Parameter Estimation of Sinusoidal Models for Speech and Audio Signals

Research paper compares three sinusoidal models for speech and audio signal processing: standard Sinusoidal Model (SM), Exponentially Damped Sinusoidal Model (EDSM), and extended adaptive Quasi-Harmonic Model (eaQHM). The study finds eaQHM performs better for medium-to-large window analysis while EDSM excels with smaller analysis windows, suggesting future research should combine both approaches.

AINeutralarXiv – CS AI · Mar 44/103

🧠

Whisper-RIR-Mega: A Paired Clean-Reverberant Speech Benchmark for ASR Robustness to Room Acoustics

Researchers introduce Whisper-RIR-Mega, a new benchmark dataset for testing automatic speech recognition robustness in reverberant acoustic environments. The study evaluates five Whisper models and finds that reverberation consistently degrades performance across all model sizes, with word error rates increasing by 0.12 to 1.07 percentage points.

AINeutralHugging Face Blog · Feb 14/107

🧠

Making automatic speech recognition work on large files with Wav2Vec2 in 🤗 Transformers

The article appears to discuss implementing automatic speech recognition for processing large audio files using Wav2Vec2 model in Hugging Face Transformers library. However, the article body is empty, preventing detailed analysis of the technical implementation or implications.