🧠 AI⚪ NeutralImportance 6/10

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

arXiv – CS AI|Yuxing Liu, Jianyu Wang, Tong Zhang|May 9, 2026 at 04:00 AM

🤖AI Summary

Researchers demonstrate that using the same optimizer during both pretraining and finetuning of large language models reduces catastrophic forgetting while maintaining or improving task performance. This "optimizer-model consistency" effect suggests optimizers create regularization patterns that preserve learned knowledge, with implications for efficient model adaptation strategies.

Analysis

The research reveals a fundamental principle in neural network training: optimizer choice creates persistent structural patterns that influence how models learn and retain information. When large language models undergo supervised finetuning with a different optimizer than used during pretraining, they experience greater catastrophic forgetting—degradation of previously learned capabilities. Using the same optimizer maintains alignment with the model's learned optimization landscape, enabling more efficient knowledge transfer.

This finding challenges conventional wisdom that treats optimizers as interchangeable tools. The study shows optimizers function as implicit regularizers, shaping activation patterns and weight update trajectories. Interestingly, the research compares Muon against AdamW, finding that Muon exhibits stronger memorization tendencies that harm reasoning task performance when datasets are limited. This suggests different optimizers encode different inductive biases about how models should learn.

For practitioners developing production language models, optimizer-model consistency offers a practical optimization principle: maintaining pretraining optimizer continuity yields better finetuning outcomes than switching to LoRA or alternative optimizers. This could reduce computational costs and improve model quality simultaneously. The finding also has theoretical significance, indicating that model behavior emerges from complex interactions between architecture, optimizer dynamics, and training data rather than isolated components.

Future research should explore whether this principle extends beyond language models to vision and multimodal systems, and whether optimizer consistency matters for continual learning scenarios involving multiple sequential finetuning stages.

Key Takeaways

→Using identical optimizers for pretraining and finetuning reduces catastrophic forgetting while maintaining task performance
→Optimizers function as implicit regularizers that shape model learning landscapes and activation patterns
→Muon optimizer exhibits memorization tendencies that degrade reasoning performance on small finetuning datasets
→Optimizer-model consistency outperforms LoRA and alternative optimizer switching strategies
→This principle could reduce computational costs while improving language model quality in production systems

#large-language-models #optimization #finetuning #catastrophic-forgetting #adamw #model-training #machine-learning

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AI2d ago

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AI2d ago

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AI3d ago

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge