Rethinking the Role of Temperature in Large Language Model Distillation
Researchers demonstrate that temperature scaling fundamentally alters the performance comparison between forward KL and reverse KL divergence in LLM distillation, revealing that forward KL substantially outperforms reverse KL at higher temperatures by better leveraging non-dominant token signals. This finding challenges the prevailing preference for reverse KL and suggests that temperature optimization enables simple KL-based methods to match state-of-the-art distillation approaches.
The paper addresses a critical gap in large language model distillation research by reconsidering the role of temperature, a hyperparameter often overlooked in comparative analyses. Previous work consistently favored reverse KL divergence over forward KL, yet these comparisons were conducted primarily at temperature Ο=1, ignoring how temperature affects each method's learning dynamics. The authors demonstrate an asymmetric mechanism: temperature enriches forward KL by amplifying signals from non-dominant tokens in the teacher distribution, while it merely rescales gradients in reverse KL. This distinction proves significant, as forward KL exhibits substantially greater sensitivity to temperature improvements.
The research emerges from a broader trend in machine learning where hyperparameter interactions shape algorithm performance more profoundly than previously appreciated. Temperature scaling in knowledge distillation controls the softness of probability distributions; softer distributions expose richer learning signals. The paper's empirical findings overturn conventional wisdom by showing forward KL surpasses reverse KL at optimal temperatures across multiple instruction-following benchmarks.
For practitioners developing efficient language models, this work offers immediate practical value. Organizations pursuing model distillation can now leverage simpler, more interpretable KL-based objectives without sacrificing performance against complex state-of-the-art methods. The discovery that temperature enables competitive performance from basic approaches reduces computational overhead in model optimization pipelines. Researchers should reassess existing distillation studies that omitted systematic temperature analysis, potentially revealing overlooked performance opportunities. The findings suggest that careful hyperparameter tuning may eliminate perceived advantages of more sophisticated divergence measures.
- βTemperature scaling affects forward KL and reverse KL asymmetrically, with forward KL gaining substantially more benefit from higher temperatures
- βForward KL outperforms reverse KL at optimal temperatures despite reverse KL's previous preference in literature conducted at Ο=1
- βTemperature enriches forward KL by amplifying non-dominant token signals while mainly rescaling reverse KL gradients
- βSimple KL-based distillation methods achieve competitive performance with state-of-the-art approaches when temperature is properly optimized
- βTemperature's role in distillation has been systematically underestimated in prior comparative analyses