#tts News & Analysis

14 articles tagged with #tts. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AIBullisharXiv – CS AI · May 277/10

🧠

PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

PilotTTS demonstrates that competitive text-to-speech systems no longer require massive proprietary datasets or complex architectures. Using only 200K hours of openly-processed data and a lightweight autoregressive model, the system achieves industry-leading performance on benchmark tests while supporting voice cloning, emotion synthesis, and multilingual capabilities.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

Researchers introduce S5-TTS, a streaming variant of T5-based text-to-speech that generates speech word-by-word with minimal latency by processing limited lookahead context. The system uses novel masking mechanisms and distillation techniques to maintain speech quality and speaker similarity while enabling real-time conversational AI applications.

AIBullisharXiv – CS AI · Jun 236/10

🧠

Bagpiper-TTS: Natural Language Guided Universal Speech Synthesis

Bagpiper-TTS is a universal speech synthesis system that uses natural language prompts to guide flexible speech generation, moving beyond rigid TTS frameworks. The model achieves competitive performance across multiple applications including multi-talker synthesis, singing voice synthesis, and intent-to-speech tasks, matching dedicated models while offering broader versatility.

AIBullisharXiv – CS AI · May 126/10

🧠

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

Researchers introduce GibbsTTS, a new zero-shot text-to-speech system using metric-induced discrete flow matching with kinetic-optimal scheduling and moment correction. The method achieves superior naturalness and speaker similarity compared to existing masked generative models and state-of-the-art TTS systems without requiring hyperparameter tuning.

AIBullisharXiv – CS AI · Mar 276/10

🧠

Voxtral TTS

Voxtral TTS is a new multilingual text-to-speech AI model that can generate natural speech from just 3 seconds of reference audio. In human evaluations, it achieved a 68.4% win rate over ElevenLabs Flash v2.5 for voice cloning, demonstrating superior naturalness and expressivity.

AIBullishMarkTechPost · Mar 176/10

🧠

Google AI Releases WAXAL: A Multilingual African Speech Dataset for Training Automatic Speech Recognition and Text-to-Speech Models

Google AI has released WAXAL, an open multilingual speech dataset covering 24 African languages to improve Automatic Speech Recognition and Text-to-Speech systems. This addresses the significant data distribution problem where African languages remain poorly represented in speech technology training corpora.

🏢 Google

AIBullisharXiv – CS AI · Mar 176/10

🧠

SyncSpeech: Efficient and Low-Latency Text-to-Speech based on Temporal Masked Transformer

Researchers introduce SyncSpeech, a new text-to-speech model that combines autoregressive and non-autoregressive approaches using a Temporal Mask Transformer architecture. The model achieves 5.8x lower first-packet latency and 8.8x improved real-time performance while maintaining comparable speech quality to existing models.

AINeutralarXiv – CS AI · Mar 126/10

🧠

Probabilistic Verification of Voice Anti-Spoofing Models

Researchers have developed PV-VASM, a probabilistic framework for verifying the robustness of voice anti-spoofing models against deepfake attacks. The model-agnostic approach estimates misclassification probability under various speech synthesis techniques including text-to-speech and voice cloning, providing formal robustness guarantees against unseen generation methods.

AIBullisharXiv – CS AI · Mar 126/10

🧠

When Fine-Tuning Fails and when it Generalises: Role of Data Diversity and Mixed Training in LLM-based TTS

Research demonstrates that LoRA fine-tuning of large language models significantly improves text-to-speech systems, achieving up to 0.42 DNS-MOS gains and 34% SNR improvements when training data has sufficient acoustic diversity. The study establishes LoRA as an effective mechanism for speaker adaptation in compact LLM-based TTS systems, outperforming frozen base models across perceptual quality, speaker fidelity, and signal quality metrics.

AIBullishMarkTechPost · Mar 116/10

🧠

Fish Audio Releases Fish Audio S2: A New Generation of Expressive Text-to-Speech (TTS) with Absurdly Controllable Emotion

Fish Audio has released S2-Pro, a flagship Large Audio Model (LAM) that enables high-fidelity, multi-speaker text-to-speech synthesis with sub-150ms latency. The system features zero-shot voice cloning capabilities and granular emotion control, representing a shift from traditional modular TTS pipelines to integrated audio models.

AIBullisharXiv – CS AI · Mar 116/10

🧠

DuplexCascade: Full-Duplex Speech-to-Speech Dialogue with VAD-Free Cascaded ASR-LLM-TTS Pipeline and Micro-Turn Optimization

DuplexCascade introduces a VAD-free cascaded streaming pipeline that enables full-duplex speech-to-speech dialogue while maintaining LLM intelligence. The system converts traditional long utterance turns into micro-turn interactions using special control tokens to coordinate turn-taking and response timing.

AIBullisharXiv – CS AI · Mar 116/10

🧠

Latent Speech-Text Transformer

Facebook Research introduces the Latent Speech-Text Transformer (LST), which aggregates speech tokens into higher-level patches to improve computational efficiency and cross-modal alignment. The model achieves up to +6.5% absolute gain on speech HellaSwag benchmarks while maintaining text performance and reducing inference costs for ASR and TTS tasks.

AINeutralarXiv – CS AI · Mar 174/10

🧠

NV-Bench: Benchmark of Nonverbal Vocalization Synthesis for Expressive Text-to-Speech Generation

Researchers introduce NV-Bench, the first standardized benchmark for evaluating nonverbal vocalizations in text-to-speech systems. The benchmark includes 1,651 multilingual utterances across 14 categories and proposes new evaluation metrics that show strong correlation with human perception.

AINeutralHugging Face Blog · Feb 275/104

🧠

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

TTS Arena introduces a new benchmarking platform for evaluating text-to-speech models through community-driven comparisons in real-world scenarios. The platform aims to provide standardized evaluation metrics for TTS quality assessment across different models and use cases.