#muon-optimizer News & Analysis

8 articles tagged with #muon-optimizer. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AIBullisharXiv – CS AI · May 97/10

🧠

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

Researchers provide theoretical proof that sign-based optimization algorithms like SignSGD outperform standard SGD under specific conditions involving ℓ1-norm stationarity and sparse noise, with complexity improvements scaling by problem dimension d. The analysis bridges theory and practice by demonstrating these advantages during GPT-2 pretraining, explaining why sign-based methods succeed in large language model training despite lacking previous theoretical justification.

AIBullisharXiv – CS AI · Mar 127/10

🧠

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

Researchers have developed HTMuon, an improved optimization algorithm for training large language models that builds upon the existing Muon optimizer. HTMuon addresses limitations in Muon's weight spectra by incorporating heavy-tailed spectral corrections, showing up to 0.98 perplexity reduction in LLaMA pretraining experiments.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 96/10

🧠

Muon Learns More Robust and Transferable Features than Adam

Research demonstrates that Muon, an emerging optimizer for large language models and vision classifiers, produces more robust and transferable features than Adam and SGD across multiple architectures. The study shows Muon-learned features maintain superior performance on corrupted data and transfer more effectively to downstream tasks, with theoretical support provided through margin and effective rank analysis.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Spectral Scaling Laws of Muon

Researchers present the first systematic study of how singular value spectra behave in Muon optimizer momentum matrices across model scales from 77M to 2.8B parameters. They discover that singular value quantiles stabilize after training burn-in and follow predictable power laws with model size, enabling practitioners to optimize Newton-Schulz iteration configurations and avoid computational waste at scale.

AIBullisharXiv – CS AI · Jun 26/10

🧠

DynMuon: A Dynamic Spectral Shaping View of Muon

Researchers propose DynMuon, an enhancement to the Muon optimizer used in large language model training that dynamically adjusts spectral shaping parameters throughout training. The method achieves lower validation loss and requires 10.6-26.5% fewer training steps than standard Muon by shifting from positive to mildly negative spectral exponents.

$UV

AINeutralarXiv – CS AI · May 286/10

🧠

How the Optimizer Shapes Learned Solutions in Equivariant Neural Networks

Researchers demonstrate that the Muon optimizer significantly outperforms Adam when training equivariant neural networks, which encode geometric symmetries by design. Analysis of trained models reveals Muon produces solutions with more regular loss surfaces, higher weight ranks, and better-conditioned representations, suggesting optimizer choice substantially influences how neural networks learn geometric constraints.

AINeutralarXiv – CS AI · May 126/10

🧠

Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

Researchers introduce intrinsic Muon (iMuon), a unified optimization framework that extends the Muon optimizer to Riemannian manifolds while preserving symmetries and enabling closed-form solutions. The approach demonstrates applications in LLM fine-tuning, image classification, and subspace learning with convergence guarantees dependent only on manifold dimension rather than factor conditioning.

AIBullisharXiv – CS AI · Mar 37/107

🧠

MuonRec: Shifting the Optimizer Paradigm Beyond Adam in Scalable Generative Recommendation

Researchers introduce MuonRec, a new optimization framework for recommendation systems that significantly outperforms the widely-used Adam/AdamW optimizers. The framework reduces training steps by 32.4% on average while improving ranking quality by 12.6% in NDCG@10 metrics across traditional and generative recommenders.