Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Researchers demonstrate that using the same optimizer during both pretraining and finetuning of large language models reduces catastrophic forgetting while maintaining or improving task performance. This "optimizer-model consistency" effect suggests optimizers create regularization patterns that preserve learned knowledge, with implications for efficient model adaptation strategies.
The research reveals a fundamental principle in neural network training: optimizer choice creates persistent structural patterns that influence how models learn and retain information. When large language models undergo supervised finetuning with a different optimizer than used during pretraining, they experience greater catastrophic forgetting—degradation of previously learned capabilities. Using the same optimizer maintains alignment with the model's learned optimization landscape, enabling more efficient knowledge transfer.
This finding challenges conventional wisdom that treats optimizers as interchangeable tools. The study shows optimizers function as implicit regularizers, shaping activation patterns and weight update trajectories. Interestingly, the research compares Muon against AdamW, finding that Muon exhibits stronger memorization tendencies that harm reasoning task performance when datasets are limited. This suggests different optimizers encode different inductive biases about how models should learn.
For practitioners developing production language models, optimizer-model consistency offers a practical optimization principle: maintaining pretraining optimizer continuity yields better finetuning outcomes than switching to LoRA or alternative optimizers. This could reduce computational costs and improve model quality simultaneously. The finding also has theoretical significance, indicating that model behavior emerges from complex interactions between architecture, optimizer dynamics, and training data rather than isolated components.
Future research should explore whether this principle extends beyond language models to vision and multimodal systems, and whether optimizer consistency matters for continual learning scenarios involving multiple sequential finetuning stages.
- →Using identical optimizers for pretraining and finetuning reduces catastrophic forgetting while maintaining task performance
- →Optimizers function as implicit regularizers that shape model learning landscapes and activation patterns
- →Muon optimizer exhibits memorization tendencies that degrade reasoning performance on small finetuning datasets
- →Optimizer-model consistency outperforms LoRA and alternative optimizer switching strategies
- →This principle could reduce computational costs while improving language model quality in production systems