19 articles tagged with #audio-processing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBearisharXiv β CS AI Β· 2d ago7/10
π§ Researchers evaluated four omnimodal AI models across text, image, audio, and video processing, finding substantial demographic and linguistic biases particularly in audio understanding tasks. The study reveals significant accuracy disparities across age, gender, language, and skin tone, with audio tasks showing prediction collapse toward narrow categories, highlighting fairness concerns as these models see wider real-world deployment.
AIBearisharXiv β CS AI Β· Mar 177/10
π§ Researchers introduce Ο-voice, a new benchmark for evaluating full-duplex voice AI agents on complex real-world tasks. The study reveals significant performance gaps, with voice agents achieving only 30-45% of text-based AI capability under realistic conditions with noise and diverse accents.
π§ GPT-5
AINeutralarXiv β CS AI Β· Mar 117/10
π§ Researchers introduce MUGEN, a comprehensive benchmark revealing significant weaknesses in large audio-language models when processing multiple concurrent audio inputs. The study shows performance degrades sharply with more audio inputs and proposes Audio-Permutational Self-Consistency as a training-free solution, achieving up to 6.74% accuracy improvements.
AIBullisharXiv β CS AI Β· Mar 46/102
π§ Researchers identified a critical problem in Large Audio-Language Models (LALMs) where audio perception deteriorates during extended reasoning processes. They developed MPARΒ² framework using reinforcement learning, which improved perception performance from 31.74% to 63.51% and achieved 74.59% accuracy on MMAU benchmark.
AIBullishOpenAI News Β· May 137/107
π§ OpenAI has announced GPT-4 Omni (GPT-4o), their new flagship AI model that can process and reason across audio, vision, and text simultaneously in real-time. This represents a significant advancement in multimodal AI capabilities, potentially setting a new standard for AI model functionality.
AINeutralarXiv β CS AI Β· Mar 276/10
π§ Researchers have developed TAAC, a framework for trustable audio-based depression diagnosis that protects user identity information while maintaining diagnostic accuracy. The system uses adversarial loss-based subspace decomposition to separate depression features from sensitive identity data, enabling secure AI-powered mental health screening.
AINeutralarXiv β CS AI Β· Mar 116/10
π§ Researchers introduce SCENEBench, a new benchmark for evaluating Large Audio Language Models (LALMs) beyond speech recognition, focusing on real-world audio understanding including background sounds, noise localization, and vocal characteristics. Testing of five state-of-the-art models revealed significant performance gaps, with some tasks performing below random chance while others achieved high accuracy.
AIBullisharXiv β CS AI Β· Mar 36/106
π§ Researchers introduce TripleSumm, a novel AI architecture that adaptively fuses visual, text, and audio modalities for improved video summarization. The team also releases MoSu, the first large-scale benchmark dataset providing all three modalities for multimodal video summarization research.
AIBullisharXiv β CS AI Β· Mar 27/1010
π§ Researchers have developed TIGER, a new speech separation model that reduces parameters by 94.3% and computational costs by 95.3% while outperforming current state-of-the-art models. The team also introduced EchoSet, a new dataset with realistic acoustic environments that shows better generalization for speech separation models.
AIBullisharXiv β CS AI Β· Mar 26/1015
π§ Researchers developed Whisper-LLaDA, a diffusion-based large language model for automatic speech recognition that achieves 12.3% relative improvement over baseline models. The study demonstrates that audio-conditioned embeddings are crucial for accuracy improvements, while plain-text processing without acoustic features fails to enhance performance.
AIBullisharXiv β CS AI Β· Mar 27/1014
π§ VoiceBridge is a new AI model that can restore high-quality 48kHz speech from various types of audio distortions using a single one-step process. The model uses a latent bridge approach with an energy-preserving variational autoencoder and transformer architecture to handle multiple speech restoration tasks simultaneously.
AIBullisharXiv β CS AI Β· Mar 26/1020
π§ Researchers introduced Resp-Agent, an AI system that uses multimodal deep learning to generate respiratory sounds and diagnose diseases. The system addresses data scarcity and representation gaps in medical AI through an autonomous agent-based approach and includes a new benchmark dataset of 229k recordings.
$CA
AIBullisharXiv β CS AI Β· Feb 275/107
π§ Researchers have developed a self-supervised learning method that can reconstruct audio and images from clipped/saturated measurements without requiring ground truth training data. The approach extends self-supervised learning to non-linear inverse problems and performs nearly as well as fully supervised methods while using only clipped measurements for training.
AIBullisharXiv β CS AI Β· Feb 276/106
π§ Researchers developed an unbiased sliced Wasserstein RBF kernel with rotary positional embedding to improve audio captioning systems by addressing exposure bias and temporal relationship issues. The method shows significant improvements in caption quality and text-to-audio retrieval accuracy on AudioCaps and Clotho datasets, while also enhancing audio reasoning capabilities in large language models.
AIBullisharXiv β CS AI Β· Mar 174/10
π§ Researchers have developed LAMB, a new AI framework that improves automated audio captioning by better aligning audio features with large language models through Cauchy-Schwarz divergence optimization. The system achieved state-of-the-art performance on AudioCaps dataset by bridging the modality gap between audio and text embeddings.
AIBullisharXiv β CS AI Β· Mar 54/10
π§ Researchers introduced LadderSym, a new Transformer-based AI method for detecting music practice errors that significantly outperforms existing approaches. The system uses multimodal processing of audio and symbolic music scores, more than doubling accuracy for detecting missed notes and improving extra note detection by 14.4 points.
AINeutralarXiv β CS AI Β· Mar 44/103
π§ Research paper compares three sinusoidal models for speech and audio signal processing: standard Sinusoidal Model (SM), Exponentially Damped Sinusoidal Model (EDSM), and extended adaptive Quasi-Harmonic Model (eaQHM). The study finds eaQHM performs better for medium-to-large window analysis while EDSM excels with smaller analysis windows, suggesting future research should combine both approaches.
AINeutralarXiv β CS AI Β· Mar 44/103
π§ Researchers introduce Whisper-RIR-Mega, a new benchmark dataset for testing automatic speech recognition robustness in reverberant acoustic environments. The study evaluates five Whisper models and finds that reverberation consistently degrades performance across all model sizes, with word error rates increasing by 0.12 to 1.07 percentage points.
AINeutralHugging Face Blog Β· Feb 14/107
π§ The article appears to discuss implementing automatic speech recognition for processing large audio files using Wav2Vec2 model in Hugging Face Transformers library. However, the article body is empty, preventing detailed analysis of the technical implementation or implications.