Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models
Researchers propose EKSFT, a novel fine-tuning method that selectively masks high-entropy and high-KL divergence tokens during supervised fine-tuning of large language models. The approach aims to preserve pre-trained model distributions while efficiently activating task-relevant capabilities in low-data regimes, demonstrating improved performance on mathematical reasoning benchmarks.
EKSFT addresses a fundamental challenge in post-training large language models: the tension between learning from limited supervised data and maintaining the integrity of pre-trained knowledge. Traditional supervised fine-tuning often causes distribution shift when datasets are small, forcing models to overfit on specific examples rather than acquiring generalizable task capabilities. This degradation subsequently hampers reinforcement learning exploration, which typically follows SFT in modern training pipelines.
The proposed entropy-KL divergence masking strategy represents an incremental but meaningful improvement in fine-tuning efficiency. By identifying and excluding tokens that exhibit maximum uncertainty or deviation from reference model behavior, EKSFT effectively filters noisy or distribution-shifting training signals. This selective approach preserves the model's foundational capabilities while injecting task-specific knowledge. The method reflects growing recognition that post-training quality depends not just on data quantity but on how training signals are curated and applied.
For AI development teams, EKSFT offers practical benefits in resource-constrained scenarios common in academic and early-stage commercial settings. The consistent improvements across mathematical reasoning benchmarks suggest the technique generalizes beyond narrow domains. More significantly, improved RL performance following EKSFT indicates downstream benefits for alignment and capability tuning stages.
The research contributes to the broader trend of making large model training more sample-efficient and interpretable. As model sizes continue growing, techniques that optimize learning from limited supervised data become increasingly valuable. Future research might explore how entropy-KL masking scales to larger models and diverse task domains beyond mathematical reasoning.
- βEKSFT selectively masks high-entropy and high-KL divergence tokens to prevent distribution shift during supervised fine-tuning.
- βThe method preserves pre-trained model distributions while activating task-relevant capabilities in low-data regimes.
- βEmpirical results show EKSFT outperforms standard SFT on mathematical reasoning benchmarks consistently.
- βImproved RL exploration performance follows EKSFT-based initialization, indicating downstream benefits for reinforcement learning stages.
- βThe approach addresses efficiency and sample optimization in post-training large language models.