AINeutralarXiv – CS AI · 6h ago6/10
🧠
Not All Synthetic Data Is Yours to Learn From
A new study finds that language models can improve by learning from their own generated text, but only when the synthetic data is compatible with the student model's existing capabilities. The research reveals that synthetic data utility is relational rather than intrinsic, and surprisingly, this self-training approach can reduce verbatim memorization by 95% without explicit unlearning objectives.