AINeutralarXiv – CS AI · 3h ago6/10
🧠
Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts
Researchers introduce PlanAudio, an LLM-based framework that generates unified audio containing speech, sound, and composites directly from free-form text prompts. The approach uses a semantic latent chain-of-thought mechanism to bridge language understanding and acoustic synthesis, outperforming existing pipeline and baseline models across multiple audio scenarios.