14 articles tagged with #speech-processing. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullishMarkTechPost ยท Mar 267/10
๐ง Tencent AI Lab has open-sourced Covo-Audio, a 7B-parameter Large Audio Language Model that can process continuous audio inputs and generate audio outputs in real-time. The model unifies speech processing and language intelligence within a single end-to-end architecture designed for seamless cross-modal interaction.
AINeutralarXiv โ CS AI ยท Mar 177/10
๐ง Researchers demonstrate that current audio deepfake detection systems incorrectly classify legitimate speech processing technologies like voice conversion and restoration as fake audio. A new multi-class detection approach shows improved accuracy by distinguishing between authentic speech, benign modifications, and actual spoofing attempts.
AINeutralarXiv โ CS AI ยท 4d ago6/10
๐ง Researchers demonstrate that applying Bayesian inference to Spiking Neural Networks (SNNs) for speech processing smooths the irregular loss landscape caused by threshold-based spike generation. Testing on speech datasets shows improved performance metrics and more regular predictive landscapes compared to deterministic approaches.
AINeutralarXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduce a structural taxonomy and unified evaluation framework for Audio Large Language Models (ALLMs) to assess fairness, safety, and security. The study reveals systematic differences in how ALLMs handle audio versus text inputs, with FSS behavior closely tied to acoustic information integration methods.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers developed training-free model steering techniques to improve reasoning in large audio-language models (LALMs) through chain-of-thought prompting. The approach achieved up to 4.4% accuracy gains and demonstrated cross-modal transfer where text-derived steering vectors can effectively guide speech-based reasoning.
AIBullisharXiv โ CS AI ยท Mar 116/10
๐ง Facebook Research introduces the Latent Speech-Text Transformer (LST), which aggregates speech tokens into higher-level patches to improve computational efficiency and cross-modal alignment. The model achieves up to +6.5% absolute gain on speech HellaSwag benchmarks while maintaining text performance and reducing inference costs for ASR and TTS tasks.
AINeutralarXiv โ CS AI ยท Mar 36/105
๐ง Researchers introduced Spoof-SUPERB, a new benchmark for evaluating self-supervised learning models' ability to detect audio deepfakes. The study tested 20 SSL models and found that large-scale discriminative models like XLS-R and WavLM Large consistently outperformed others, especially under acoustic degradations.
AIBullisharXiv โ CS AI ยท Feb 275/103
๐ง Researchers developed Lipi-Ghor-882, an 882-hour Bengali speech dataset, and demonstrated that targeted fine-tuning with synthetic acoustic degradation significantly improves automatic speech recognition for long-form Bengali audio. Their dual pipeline achieved a 0.019 Real-Time Factor, establishing new benchmarks for low-resource speech processing.
AIBullishHugging Face Blog ยท Apr 96/105
๐ง Hugging Face and Cloudflare have partnered to launch FastRTC, a solution designed to enable seamless real-time speech and video processing. This collaboration combines Hugging Face's AI models with Cloudflare's edge computing infrastructure to reduce latency in real-time communications.
AINeutralarXiv โ CS AI ยท 3d ago5/10
๐ง Researchers developed a lightweight machine learning system that detects voicemail greetings versus live human answers in real-time telephony audio with 96.1% accuracy using only temporal speech activity patterns. The system processes calls in 46ms on standard CPUs and has been validated across 77,000 production calls, achieving practical false positive and negative rates suitable for AI calling applications.
AINeutralarXiv โ CS AI ยท Mar 124/10
๐ง Researchers propose AMB-DSGDN, a new AI system for multimodal emotion recognition that uses adaptive modality balancing and differential graph attention mechanisms. The system addresses limitations in existing approaches by filtering noise and preventing dominant modalities from overwhelming the fusion process in text, speech, and visual data.
AINeutralarXiv โ CS AI ยท Mar 44/104
๐ง Researchers have developed TVF (Time-Varying Filtering), a lightweight 1 million parameter speech enhancement model that combines digital signal processing with deep learning for real-time speech denoising. The model uses a neural network to predict coefficients for a 35-band IIR filter cascade, offering interpretable processing while adapting dynamically to changing noise conditions.
AINeutralarXiv โ CS AI ยท Mar 34/103
๐ง CodecFlow is a new neural codec-based framework for speech bandwidth extension that efficiently reconstructs high-quality audio in compact latent space. The system uses conditional flow matching and residual vector quantization to improve speech clarity by restoring high-frequency content from low-bandwidth audio.
AINeutralarXiv โ CS AI ยท Feb 273/107
๐ง This linguistic research study analyzes how Vietnamese learners of Mandarin Chinese acquire prosodic patterns, finding that advanced learners achieve native-like quantity in speech boundaries but develop inverted structural mapping patterns. The study reveals a trade-off between maintaining fluent output and achieving accurate prosodic structure in second language acquisition.