StoSignSGD: Unbiased Structural Stochasticity Fixes SignSGD for Training Large Language Models
Researchers introduce StoSignSGD, a novel optimization algorithm that fixes convergence issues in SignSGD by injecting structural stochasticity while maintaining unbiased updates. The algorithm demonstrates 1.44x to 2.14x speedup in low-precision FP8 LLM pretraining where AdamW fails, and outperforms existing optimizers in mathematical reasoning fine-tuning tasks.
StoSignSGD addresses a fundamental limitation in sign-based optimization algorithms widely used for training large language models. SignSGD's inability to converge on non-smooth objectives—commonplace in modern architectures with ReLUs, max-pools, and mixture-of-experts—has constrained its applicability despite superior empirical performance in distributed settings. This research bridges that gap through structural stochasticity injection, enabling convergence guarantees across convex and non-convex optimization landscapes.
The algorithm's theoretical contributions are substantial. For convex optimization, StoSignSGD achieves convergence rates matching information-theoretic lower bounds. For non-convex non-smooth problems, the researchers introduce generalized stationary measures and prove improvements over existing complexity bounds by dimensional factors, suggesting the approach addresses deeper algorithmic limitations rather than offering incremental gains.
Practical implications center on efficient large model training. The 1.44x to 2.14x speedup in FP8 pretraining is particularly significant because low-precision computation directly reduces memory consumption and hardware costs—critical bottlenecks in foundation model development. AdamW's catastrophic failure in this regime versus StoSignSGD's stability suggests a potential paradigm shift for resource-constrained training. The gains in mathematical reasoning fine-tuning indicate benefits extend beyond computational efficiency to model quality.
The sign conversion framework enabling optimizer transformation adds methodological value beyond this specific algorithm, potentially influencing future optimizer design. For the AI infrastructure and model training communities, this work demonstrates that theoretical rigor and empirical efficiency aren't mutually exclusive in optimization research. Practitioners training large models on budget-constrained systems should monitor implementation availability and adoption rates.
- →StoSignSGD resolves SignSGD's non-convergence on non-smooth objectives through structural stochasticity while maintaining unbiased updates
- →Achieves 1.44x to 2.14x speedup in FP8 pretraining where AdamW fails catastrophically
- →Provides theoretical convergence guarantees for both convex and non-convex non-smooth optimization with improved complexity bounds
- →Outperforms AdamW and SignSGD on mathematical reasoning fine-tuning tasks for 7B LLMs
- →Introduces sign conversion framework enabling transformation of any optimizer into unbiased sign-based variant