A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design
Researchers propose a new framework for supervised fine-tuning (SFT) of language models that reinterprets the training process as target distribution design rather than simple token likelihood maximization. The Q-target framework allows models to allocate probability mass flexibly across token alternatives, unifying existing SFT variants and demonstrating consistent performance improvements across reasoning tasks.
This research addresses a fundamental limitation in how large language models are trained on demonstrated trajectories. Traditional supervised fine-tuning forces models to assign maximum probability to observed tokens regardless of context, ignoring that demonstrated sequences may contain errors, ambiguities, or suboptimal choices. The proposed Q-target framework reframes this problem by treating SFT as an explicit design choice: deciding how much to trust the observed token versus distributing probability across plausible alternatives that align with the model's learned priors.
The work emerges from growing recognition that language models trained on internet-scale data contain substantial knowledge that may diverge from demonstration labels. When a model's learned distribution conflicts with training targets, strict one-hot fitting can suppress useful knowledge and degrade generalization. By allowing soft targets rather than hard assignments, the framework preserves the model's prior understanding while incorporating demonstration signals.
The practical impact extends across AI model development pipelines. Developers using this approach report consistent gains across ten reasoning dataset-model combinations, suggesting broad applicability beyond specific domains. This methodological advance matters for organizations fine-tuning models for specialized tasks where training data quality varies and model priors encode valuable information. The framework also unifies seemingly disparate SFT variants under a single principle, providing researchers a more systematic vocabulary for exploring training objectives.
Looking forward, this work opens investigation into how different target distribution designs affect model behavior in reasoning, instruction-following, and alignment tasks. The framework may particularly influence development of reasoning-focused models where intermediate steps often admit multiple valid solutions.
- βSFT can be reframed as target distribution design rather than one-hot token likelihood maximization, creating more flexible training objectives.
- βThe Q-target framework explicitly decomposes supervision into two components: reliance on observed tokens and probability allocation across alternatives.
- βThis approach unifies existing SFT variants as implicit choices within a broader design space.
- βTarget-SFT demonstrates consistent performance improvements across ten reasoning dataset-model settings in evaluation.
- βThe framework preserves valuable model priors during fine-tuning rather than suppressing them through strict token-level fitting.