🧠 AI⚪ NeutralImportance 6/10

Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?

arXiv – CS AI|Dingzhi Yu, Hongyi Tao, Yuanyu Wan, Luo Luo, Lijun Zhang|June 23, 2026 at 04:00 AM

🤖AI Summary

Researchers identify a critical theoretical gap in AdamW, the dominant optimizer for training large language models, questioning whether it can handle heavy-tailed gradient noise common in LLM pretraining. The paper formulates this as an open problem and provides partial theoretical insights, while noting that simpler optimizers like Lion and Muon have already achieved convergence guarantees under heavy-tailed conditions.

Analysis

AdamW's dominance in large language model training rests on strong empirical performance, yet its theoretical foundations remain shaky when applied to real-world LLM pretraining conditions. The core tension revolves around heavy-tailed noise in stochastic gradients—occasional extremely large gradient values that violate traditional finite-variance assumptions underlying most optimizer theory. This gap between theory and practice has become increasingly untenable as empirical evidence from actual LLM training runs consistently demonstrates heavy-tailed noise distributions.

The research landscape has shifted notably with recent breakthroughs showing sign-based optimizers (Lion, Muon) and AdaGrad can provably converge under heavy-tailed assumptions. These results intensified scrutiny of AdamW's theoretical vulnerabilities, particularly its second-moment accumulator mechanism—a core component that may either help or hinder convergence in heavy-tailed regimes. The authors' lower-bound mechanism reveals how momentum memory can mask large gradients, potentially creating pathological convergence scenarios.

For the AI development community, this theoretical investigation directly impacts optimizer design choices and training stability guarantees. If AdamW cannot handle heavy-tailed noise, practitioners might face unexpected convergence failures at scale, or the success of current LLM training regimes may rely on luck rather than principled design. Resolving this open problem could reshape optimizer development, potentially accelerating research into theoretically-grounded alternatives. The positive weighted-metric benchmark suggests AdamW may survive under modified assumptions, but proof remains elusive.

Key Takeaways

→AdamW lacks rigorous convergence theory for heavy-tailed gradient noise despite being the standard LLM optimizer
→Sign-based optimizers and AdaGrad have already achieved proven convergence under heavy-tailed conditions, creating a theoretical gap
→The second-moment accumulator in AdamW may either enable or obstruct convergence in heavy-tailed regimes—this remains unresolved
→The denominator memory mechanism can hide large gradients in ways that potentially prevent convergence
→Resolving this open problem could fundamentally reshape optimizer design for future large-scale AI training