Training for the Model You Return: Improving Optimization for Iterate-Averaged Language Models
Researchers present PACE, a novel optimizer training method that improves language model performance by optimizing for iterate-averaged weights rather than final training weights. By formulating the problem as an optimal-control challenge and wrapping AdamW with a clipped pulling mechanism toward exponential moving averages, PACE demonstrates theoretical convergence improvements and empirical gains across 1-2B parameter models and GPT-2 pretraining.
This research addresses a practical disconnect in modern language model training: most production LM pipelines deploy averaged model weights rather than final training iterates, yet optimization algorithms are designed to minimize final-iterate loss. The gap between what we train for and what we actually use has gone largely unexamined until this work.
The technical contribution stems from optimal-control theory applied to stochastic optimization. By modeling the problem in continuous time with a quadratic approximation, researchers derived a control strategy that balances performance improvement against computational overhead. PACE implements this as a lightweight modifier to AdamW, using clipped per-coordinate control strength to pull live weights toward their exponential moving average. This elegant simplicity enables practical adoption without extensive hyperparameter tuning.
The theoretical guarantees are meaningful but bounded—PACE achieves standard convergence rates up to a factor dependent on the averaging rule, and can improve limiting error arbitrarily on certain quadratic instances. Empirically, results span supervised fine-tuning of 1-2B models and GPT-2 pretraining on FineWeb, showing consistent improvements over both vanilla AdamW and EMA-evaluated AdamW across varied learning rates and schedules.
For the AI development community, this work highlights an under-optimized dimension in training pipelines. While the improvements appear incremental rather than transformative, they operate at the foundation of model training—the kind of systematic refinement that compounds across the industry. The method's lightweight nature and compatibility with existing workflows increases adoption likelihood. However, the scope remains primarily academic and research-focused rather than immediately impacting production systems at scale.
- →PACE optimizer pulls live weights toward exponential moving averages during training, optimizing directly for the averaged model weights that are actually deployed.
- →Theoretical analysis proves convergence at standard rates with potential for strictly improving limiting error in quadratic settings by arbitrarily large factors.
- →Empirical validation shows consistent improvements over AdamW across 1-2B parameter supervised fine-tuning and GPT-2 pretraining on FineWeb.
- →The method wraps AdamW as a lightweight modifier requiring minimal integration effort into existing training pipelines.
- →Results hold robustness across varied hyperparameters including learning rates and decay schedules, suggesting broad applicability.