🧠 AI⚪ NeutralImportance 4/10

Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD?

arXiv – CS AI|Jihwan Kim, Dogyoon Song, Chulhee Yun|March 3, 2026 at 05:00 AM|4 views

🤖AI Summary

Researchers analyzed scaling laws for signSGD optimization in machine learning, comparing it to standard SGD under a power-law random features model. The study identifies unique effects in signSGD that can lead to steeper compute-optimal scaling laws than SGD in noise-dominant regimes.

Key Takeaways

→SignSGD exhibits drift-normalization and noise-reshaping effects that are unique compared to standard SGD optimization.
→The noise-reshaping effect can make signSGD's compute-optimal slope steeper than SGD in regimes where noise dominates.
→Warmup-stable-decay scheduling further reduces noise and improves compute-optimal scaling when feature decay is fast but target decay is slow.
→The analysis provides theoretical framework for understanding when signSGD outperforms SGD in linear regression tasks.
→Risk scaling depends on model size, training steps, learning rate, and both feature and target decay parameters.