UniVoice: A Unified Model for Speech and Singing Voice Generation
UniVoice is a unified AI model that generates both speech and singing from text using conditional flow matching, achieving performance comparable to dedicated speech systems while outperforming existing unified models for singing synthesis. The breakthrough lies in factorizing conditioning into content, melody, and timbre components, with melody constraints applied only to singing while speech prosody remains flexible.
UniVoice addresses a fundamental technical challenge in generative audio: creating a single model capable of handling two distinct vocal synthesis tasks with conflicting requirements. Traditional approaches either build separate models for speech and singing or compromise performance by forcing melody constraints on both tasks. This research demonstrates that architectural innovation—factorizing conditioning through modality-appropriate encoders and a shared DiT backbone—enables unified generation without sacrificing quality.
The model's design elegantly solves the melody constraint problem through a learned null melody token for speech, which functions as an approximation to melody marginalization in conditional flow matching. This mathematical insight allows the same architecture to handle MIDI-controlled melody for singing while preserving natural prosody inference for speech from linguistic context.
Training on 65,000 hours of combined speech and singing data, UniVoice achieves a 5.26% phoneme error rate on speech—matching F5-TTS and CosyVoice3—while achieving 16.22% on singing, substantially beating Vevo1.5's 24.72%. These metrics indicate the approach doesn't sacrifice performance in either domain to achieve unification.
The implications extend beyond academic interest. A genuinely unified vocal synthesis system reduces computational overhead, simplifies deployment pipelines, and enables potential future applications combining speech and singing in coherent outputs. The research validates conditional flow matching as a superior alternative to diffusion-based approaches for multi-modal generation. However, generalization to underrepresented languages and real-time inference capabilities remain open questions for practical adoption.
- →UniVoice achieves parity with dedicated speech synthesis models while substantially outperforming previous unified speech-singing systems.
- →The null melody token innovation enables melody constraints for singing without restricting natural speech prosody inference.
- →Factorized conditioning through modality-appropriate encoders provides a scalable architecture for multi-task vocal synthesis.
- →The 65,000-hour training dataset demonstrates feasibility of large-scale unified vocal synthesis training.
- →Conditional flow matching outperforms diffusion-based baselines for this multi-modal generation challenge.