Researchers demonstrate that neural network solutions trained with specific optimizers like AdamW and Muon form connected sets at large network widths, revealing optimizer-dependent structure in loss landscapes. The study shows that different optimizers converge to disconnected solutions with provable loss barriers in small networks, while empirically in GPT-2 pretraining, same-optimizer paths preserve model spectra differently than cross-optimizer paths.
This research addresses a fundamental gap in deep learning theory by examining how optimizer choice shapes the geometric structure of solution spaces. While mode connectivity—the study of whether different neural network solutions can be connected through low-loss paths—has received substantial attention, the role of specific optimizers in determining this connectivity remains understudied. The paper reveals that optimizer-induced implicit regularization creates distinct solution manifolds that may or may not intersect depending on network width and regularization parameters.
The findings integrate classical optimization theory with modern empirical practice in large language models. Two-layer ReLU networks serve as a theoretical testbed, proving that solutions from a single optimizer (AdamW, Muon, or Lion-family variants) become connected at sufficient width—a non-trivial result suggesting that optimizer identity fundamentally constrains the reachable solution space. The characterization of how different optimizer regions interact has immediate relevance to transfer learning and model merging, where practitioners combine models trained with different optimizers.
For developers building production systems, these results suggest that optimizer choice influences not just convergence speed but the actual geometry of learned representations. The empirical observation that same-optimizer paths preserve spectral properties while cross-optimizer paths show smooth transitions indicates practical implications for fine-tuning strategies and ensemble methods. Understanding these optimizer-dependent structures could inform better initialization schemes and transfer learning protocols.
- →Solutions from individual optimizers form connected sets at large network widths, embedding optimizer identity into loss landscape geometry
- →Different optimizers can converge to disconnected zero-loss solutions separated by provable loss barriers in small networks
- →Same-optimizer paths in GPT-2 preserve model spectrum characteristics while cross-optimizer paths show smooth spectral transitions
- →Optimizer-induced implicit regularization creates distinct solution manifolds that may overlap or remain disjoint based on width and regularization
- →Results suggest optimizer choice affects representation geometry beyond convergence speed, relevant for transfer learning and model merging