ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Researchers have released ParsVoice, a 2,200-hour Persian speech dataset with 1.36 million aligned segments from 1,815 speakers, making it 25 times larger than previous Persian TTS resources. The dataset was constructed using an automated pipeline combining ASR, fine-tuned language models, and quality assessment, and validation shows the corpus enables multi-speaker text-to-speech systems competitive with existing solutions.
ParsVoice addresses a significant gap in open-source speech resources for Persian, a language with over 70 million speakers but minimal representation in public datasets. The research team developed a scalable industrial pipeline that transforms raw audiobook recordings into high-quality training data through automated sentence boundary detection, punctuation restoration, and speaker identification—reducing manual annotation costs while maintaining quality standards. This approach demonstrates how structured methodology can extract usable datasets from existing media, a pattern increasingly relevant as raw content abundance exceeds labeled data availability.
The dataset's scale and quality reflect broader trends in democratizing AI capabilities across languages. Previously, Persian TTS research relied on proprietary datasets or significantly smaller public resources, creating a competitive disadvantage for researchers and developers in Persian-speaking regions. The release of 2,200 hours of TTS-ready data fundamentally changes this dynamic, enabling local innovation ecosystems to build sophisticated voice synthesis products comparable to English or Mandarin alternatives.
For the AI community, ParsVoice validates that zero-shot multilingual models like XTTS can achieve respectable results (3.6/5 naturalness MOS) without language-specific phoneme engineering, suggesting transfer learning approaches are viable for underrepresented languages. This has implications for scaling TTS to other low-resource languages cost-effectively. Developers targeting Persian markets gain immediate access to production-quality training data, while researchers can now explore linguistic phenomena specific to Persian speech in ways previously impossible.
- →ParsVoice is 25 times larger than the previous largest open Persian TTS dataset with 2,200 hours and 1.36 million segments.
- →An automated pipeline combining ASR, BERT classifiers, and quality assessment eliminated manual annotation bottlenecks while maintaining data quality.
- →Zero-shot multilingual TTS models achieve competitive results on Persian without language-specific phoneme representations.
- →The dataset enables local development of voice synthesis applications for a 70+ million speaker language previously underserved by open resources.
- →Scalable dataset construction from audiobooks demonstrates a replicable model for expanding AI training data in other low-resource languages.