🧠 AI⚪ NeutralImportance 6/10

Spectral Scaling Laws of Muon

arXiv – CS AI|Gagik Magakyan, Pablo Parrilo, Asuman Ozdaglar|June 4, 2026 at 04:00 AM

🤖AI Summary

Researchers present the first systematic study of how singular value spectra behave in Muon optimizer momentum matrices across model scales from 77M to 2.8B parameters. They discover that singular value quantiles stabilize after training burn-in and follow predictable power laws with model size, enabling practitioners to optimize Newton-Schulz iteration configurations and avoid computational waste at scale.

Analysis

This research addresses a critical but understudied aspect of modern language model training infrastructure. Muon has emerged as a preferred optimizer for state-of-the-art open-source models, yet its reliance on approximate orthonormalization through Newton-Schulz iteration created uncertainty about performance scaling. The study fills this gap by empirically tracking singular value behavior across model depths and sizes, revealing unexpected regularities that suggest the optimization landscape is more predictable than previously understood.

The power law relationships discovered—particularly the aggressive M^-0.96 scaling in late layers—have immediate practical implications. Current academic configurations using five Newton-Schulz steps suffice for smaller models but risk degradation at frontier scales where singular values shrink below effective orthonormalization thresholds. This creates a tension between computational cost and update quality that practitioners must actively manage. The layer-aware insights allow developers to avoid brute-force solutions like uniformly increasing iterations, instead enabling targeted optimizations where they matter most.

For the AI infrastructure community, this work reduces a previously opaque hyperparameter space into interpretable, scalable principles. Organizations training large models can now make principled tradeoffs between numerical precision and computational efficiency rather than relying on empirical tuning. The research validates that Muon's architectural choices have coherent mathematical structure, increasing confidence in its adoption across frontier models. Going forward, practitioners should monitor whether these power law relationships hold for models exceeding 2.8B parameters and whether different architectures or data regimes alter the observed scaling patterns.

Key Takeaways

→Singular value quantiles in Muon momentum buffers stabilize after burn-in and follow clean power laws scaled by model size
→Late-layer singular values scale aggressively (M^-0.96) and may fall below Newton-Schulz orthonormalization thresholds at frontier scales
→Mid-depth layers scale mildly (M^-0.25), allowing current five-iteration configurations to remain effective at larger scales
→Power law relationships enable layer-aware, computationally efficient selection of Newton-Schulz iteration counts instead of uniform increases
→Research provides principled optimization guidance for practitioners training state-of-the-art models with Muon optimizer