AINeutralarXiv – CS AI · 18h ago6/10
🧠
GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model
GenTSE introduces a two-stage generative language model for target speaker extraction that separates semantic and acoustic token generation, demonstrating improved speech quality and speaker consistency over previous LM-based approaches. The system employs novel training strategies including Frozen-LM Conditioning and Direct Preference Optimization to reduce exposure bias and align outputs with human perceptual preferences.