#audio-generation News & Analysis

12 articles tagged with #audio-generation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

12 articles

AIBullisharXiv – CS AI · Jun 57/10

🧠

UniVoice: A Unified Model for Speech and Singing Voice Generation

UniVoice is a unified AI model that generates both speech and singing from text using conditional flow matching, achieving performance comparable to dedicated speech systems while outperforming existing unified models for singing synthesis. The breakthrough lies in factorizing conditioning into content, melody, and timbre components, with melody constraints applied only to singing while speech prosody remains flexible.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Low-Resource Guidance for Controllable Latent Audio Diffusion

Researchers have developed a new method called Latent-Control Heads (LatCHs) that enables efficient control of audio generation in diffusion models with significantly reduced computational costs. The approach operates directly in latent space, avoiding expensive decoder steps and requiring only 7M parameters and 4 hours of training while maintaining audio quality.

AIBullishOpenAI News · Sep 307/107

🧠

Sora 2 System Card

OpenAI has released Sora 2, an advanced video and audio generation model that significantly improves upon its predecessor. The new model features enhanced physics accuracy, sharper realism, synchronized audio capabilities, better user control, and expanded stylistic options.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

Bagpiper-TTS is a universal speech synthesis system that uses natural language prompts to guide flexible speech generation, moving beyond rigid TTS frameworks. The model achieves competitive performance across multiple applications including multi-talker synthesis, singing voice synthesis, and intent-to-speech tasks, matching dedicated models while offering broader versatility.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Whisper-GPT -- Continuous Discrete Hybrid Representation Language Models For Speech And Music

Researchers introduce Whisper-GPT, a hybrid language model that combines continuous audio representations (spectrograms) with discrete acoustic tokens to improve speech and music generation. This approach addresses context length limitations in traditional token-based models while maintaining high-fidelity audio synthesis capabilities.

🏢 Perplexity

AINeutralarXiv – CS AI · Jun 16/10

🧠

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

Researchers introduce ImmersiveTTS, an AI model that generates natural speech integrated within environmental audio contexts using multimodal diffusion transformers and domain-specific representation alignment. The advancement addresses a key challenge in audio generation: seamlessly combining speech with background environmental sounds while maintaining acoustic quality and intelligibility.

AINeutralarXiv – CS AI · Mar 37/108

🧠

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

Researchers introduce AG-REPA, a new method for improving audio generation models by strategically selecting which neural network layers to align with teacher models. The approach identifies that layers storing the most information aren't necessarily the most important for generation, leading to better performance in speech and audio synthesis.

AINeutralOpenAI News · Jun 206/106

🧠

Consistency Models

Diffusion models have made significant breakthroughs in generating images, audio, and video content. However, these models face a key limitation in their reliance on iterative sampling processes, which results in slower generation speeds.

AINeutralOpenAI News · Mar 296/103

🧠

Navigating the challenges and opportunities of synthetic voices

OpenAI shares insights from a limited preview of Voice Engine, their model for creating synthetic custom voices. The company is exploring the technology's potential while addressing associated challenges and risks.

AINeutralOpenAI News · Apr 306/104

🧠

Jukebox

A new neural network called Jukebox has been introduced that can generate music and rudimentary singing as raw audio across various genres and artist styles. The developers are releasing the model weights, code, and exploration tools to the public.

AINeutralarXiv – CS AI · Apr 64/10

🧠

Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS

Researchers developed a two-stage prompt selection strategy for zero-shot text-to-speech synthesis that improves emotional intensity and speaker consistency. The method evaluates prompts using prosodic features, audio quality, and text-emotion coherence in a static stage, then uses textual similarity for dynamic prompt selection during synthesis.

AINeutralHugging Face Blog · Aug 303/107

🧠

AudioLDM 2, but faster ⚡️

The article announces AudioLDM 2 with improved speed performance. However, the article body appears to be empty or incomplete, limiting detailed analysis of the technical improvements or implications.