GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model
GenTSE introduces a two-stage generative language model for target speaker extraction that separates semantic and acoustic token generation, demonstrating improved speech quality and speaker consistency over previous LM-based approaches. The system employs novel training strategies including Frozen-LM Conditioning and Direct Preference Optimization to reduce exposure bias and align outputs with human perceptual preferences.
GenTSE represents a meaningful advancement in speech separation technology by applying generative language modeling principles to the target speaker extraction problem. The two-stage architecture decouples semantic understanding from acoustic reconstruction, allowing each stage to optimize independently while maintaining coherence across the full pipeline. This separation addresses a fundamental challenge in generative speech modeling: balancing semantic fidelity with acoustic quality without compounding errors from a single end-to-end process.
The research builds on the broader trend of applying large language models to speech tasks, extending beyond traditional discriminative approaches toward generative modeling. Previous TSE systems either struggled with fidelity or required discretized tokens that lose information. GenTSE's use of continuous embeddings from SSL or codec models preserves richer context, enabling more nuanced speech reconstruction. The Frozen-LM Conditioning strategy directly tackles a known problem in autoregressive models: the exposure bias where training and inference distributions diverge.
For the AI and speech processing industry, this work validates that LM-based generative approaches can outperform existing methods on critical metrics. Commercial applications in speech enhancement, hearing aids, conference call clarity, and voice assistant robustness could benefit from improved speaker separation accuracy. The application of Direct Preference Optimization shows how modern alignment techniques from LLM development successfully transfer to speech tasks, suggesting similar cross-domain applications may yield further improvements.
Future developments should focus on computational efficiency for real-time deployment and generalization to diverse acoustic environments and speaker populations beyond the Libri2Mix benchmark.
- βTwo-stage generative architecture separates semantic and acoustic token generation for improved stability and accuracy in speech extraction
- βContinuous embedding representations preserve richer context than discretized tokens used in previous LM-based TSE systems
- βFrozen-LM Conditioning strategy effectively reduces exposure bias between training and autoregressive inference
- βDirect Preference Optimization successfully aligns model outputs with human perceptual preferences in speech quality
- βSystem surpasses prior LM-based approaches on speech quality, intelligibility, and speaker consistency metrics