AINeutralarXiv – CS AI · 7h ago6/10
🧠
Rethinking the Role of Temperature in Large Language Model Distillation
Researchers demonstrate that temperature scaling fundamentally alters the performance comparison between forward KL and reverse KL divergence in LLM distillation, revealing that forward KL substantially outperforms reverse KL at higher temperatures by better leveraging non-dominant token signals. This finding challenges the prevailing preference for reverse KL and suggests that temperature optimization enables simple KL-based methods to match state-of-the-art distillation approaches.