🧠 AI🟢 BullishImportance 7/10

End-to-End Training for Discrete Token LLM based TTS System

arXiv – CS AI|Changfeng Gao, Yong Ren, Jun Yuan, Ye Bai, Zhao You, ShiDong Shang|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers propose a fully end-to-end training framework that jointly optimizes all components of discrete-token-based text-to-speech systems—speech tokenizers, language models, diffusion models, and reward models—rather than training them independently. The approach achieves state-of-the-art results on benchmark tests with smaller, more efficient models.

Analysis

This research addresses a fundamental inefficiency in modern TTS architecture by eliminating the training disconnect between pipeline components. Traditional cascaded systems optimize each module separately, creating inference-time mismatches where downstream models receive distributions of tokens they were never trained on. The proposed end-to-end framework resolves this through multi-task learning objectives that align the speech tokenizer with the needs of both language modeling and audio reconstruction, while simultaneously tuning the LLM using feedback from downstream components.

The approach represents a broader trend in machine learning toward holistic system optimization rather than modular composition. Similar principles have driven improvements in other domains, from vision-language models to reinforcement learning systems. By unifying training, the researchers demonstrate that discrete token spaces can better capture the specific acoustic and semantic information needed for high-quality synthesis, not just general speech characteristics.

For the AI development community, these results validate that simpler training pipelines can achieve superior performance when properly integrated. The achievement of 0.78% word error rate with a 0.6B-parameter LLM is particularly significant—it demonstrates that efficiency and quality need not be traded against each other. This has implications for deployment at scale and could influence how commercial TTS providers architect their systems.

Looking forward, this work suggests that end-to-end optimization will become standard practice in sequential generation tasks. The framework's success with smaller models also indicates that future TTS systems may require fewer computational resources while maintaining quality, potentially democratizing high-quality speech synthesis for smaller organizations and applications.

Key Takeaways

→End-to-end joint training of all TTS components outperforms independent cascaded training across multiple metrics
→Achieves state-of-the-art 0.78% WER on Seed-TTS-Eval using only 0.6B-parameter LLM and 0.5B-parameter FM model
→Multi-task optimization during tokenizer training creates discrete speech spaces better aligned with downstream task requirements
→Eliminates inference-time distribution mismatch by training LLM with reconstruction and recognition feedback from downstream models
→Demonstrates that simpler, more integrated training pipelines can deliver superior performance compared to complex modular approaches