#adamw News & Analysis

4 articles tagged with #adamw. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBullisharXiv – CS AI · Feb 277/108

🧠

FlashOptim: Optimizers for Memory Efficient Training

FlashOptim introduces memory optimization techniques that reduce AI training memory requirements by over 50% per parameter while maintaining model quality. The suite reduces AdamW memory usage from 16 bytes to 7 bytes per parameter through improved master weight splitting and 8-bit optimizer state quantization.

AINeutralarXiv – CS AI · May 126/10

🧠

Optimizer-Induced Mode Connectivity: From AdamW to Muon

Researchers demonstrate that neural network solutions trained with specific optimizers like AdamW and Muon form connected sets at large network widths, revealing optimizer-dependent structure in loss landscapes. The study shows that different optimizers converge to disconnected solutions with provable loss barriers in small networks, while empirically in GPT-2 pretraining, same-optimizer paths preserve model spectra differently than cross-optimizer paths.

AINeutralarXiv – CS AI · May 96/10

🧠

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Researchers demonstrate that using the same optimizer during both pretraining and finetuning of large language models reduces catastrophic forgetting while maintaining or improving task performance. This "optimizer-model consistency" effect suggests optimizers create regularization patterns that preserve learned knowledge, with implications for efficient model adaptation strategies.

AINeutralarXiv – CS AI · Mar 24/105

🧠

Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training

Researchers analyzed training trajectories in small transformer models, finding that parameter updates organize into a dominant drift direction with transverse dynamics. The study reveals that different optimizers (AdamW vs SGD) create substantially different trajectory geometries, with AdamW developing multi-dimensional structures while SGD produces more linear evolution.