🧠 AI⚪ NeutralImportance 6/10

Targeted Speaker Poisoning Framework in Zero-Shot Text-to-Speech

arXiv – CS AI|Thanapat Trachu, Thanathai Lertpetchpun, Sai Praneeth Karimireddy, Shrikanth Narayanan|June 1, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Speech Generation Speaker Poisoning (SGSP), a framework for removing specific speaker identities from zero-shot text-to-speech models while maintaining utility for other speakers. The study evaluates privacy-utility trade-offs and identifies scalability limitations when attempting to forget more than 15 speakers, highlighting emerging challenges in generative voice privacy.

Analysis

Zero-shot text-to-speech technology enables rapid voice cloning from minimal audio samples, creating significant privacy vulnerabilities where individuals could have their voices synthesized without consent. This research formalizes the speaker poisoning problem—how to surgically remove specific identities from trained models—addressing a gap that conventional machine unlearning approaches cannot adequately solve. Traditional unlearning assumes static data patterns, but zero-shot TTS systems can reconstruct voices dynamically, requiring novel defensive strategies.

The work establishes evaluation metrics comparing word error rate (WER) preservation against speaker similarity metrics, demonstrating effective forgetting up to 15 identities. However, performance degrades substantially at 100 speakers due to increased identity overlap in learned representations, revealing fundamental architectural constraints. This scalability wall suggests that current neural network designs may require architectural innovations to support large-scale speaker deletion without catastrophic utility loss.

For the AI and voice technology industry, this research signals growing regulatory and ethical pressure around voice synthesis models. As generative audio becomes mainstream—powering virtual assistants, audiobooks, and deepfake applications—stakeholders face demands for privacy-preserving mechanisms. The framework provides both developers and regulators a structured approach to audit whether models can respect speaker deletion requests, potentially informing future compliance standards similar to GDPR's right-to-be-forgotten.

Future work likely focuses on improving the 100+ speaker scenario through better model architectures, federated approaches, or hybrid inference-time filtering strategies. The research underscores that privacy in generative models requires fundamental rethinking beyond post-hoc mitigation, positioning this as foundational work for responsible voice AI development.

Key Takeaways

→Zero-shot TTS voice cloning demands novel privacy protections beyond conventional machine unlearning due to dynamic reconstruction capabilities.
→Speech Generation Speaker Poisoning achieves strong privacy deletion for up to 15 speakers while maintaining model utility for other voices.
→Scalability degrades significantly at 100+ speakers due to identity overlap in neural representations, indicating architectural limitations.
→The framework establishes quantifiable metrics (AUC, FSSIM) for measuring privacy-utility trade-offs in voice synthesis models.
→Results suggest future regulatory compliance standards for voice models may require certified speaker deletion mechanisms.