Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network
Researchers propose a continuous-time mathematical model for analyzing gradient descent dynamics in the Edge of Stability regime, where large learning rates cause oscillations in neural network training. The model introduces an effective free energy framework that combines risk with a curvature-related term, enabling better prediction of training dynamics in wide two-layer networks and validated on matrix factorization and CIFAR-10 tasks.
This theoretical research addresses a fundamental challenge in deep learning optimization: understanding the behavior of gradient descent when learning rates are set aggressively enough to cause persistent oscillations. The Edge of Stability regime, characterized by unstable yet controlled oscillation patterns, has been empirically observed in modern neural network training but lacked rigorous mathematical characterization until now.
The work builds on recent advances in understanding non-equilibrium dynamics in machine learning. Previous research identified that large learning rates could induce beneficial oscillatory behavior, but tracking the envelope of these oscillations while weights simultaneously evolve remained unsolved. The researchers' innovation lies in introducing an effective free energy functional that captures both the original loss landscape and entropic contributions from weight fluctuations, providing a unified framework for analysis.
This has meaningful implications for practitioners designing neural network training pipelines. By establishing that an effective free energy predicts oscillation envelopes, the framework offers new tools for hyperparameter selection and learning rate scheduling. Understanding when and why oscillations occur could enable more principled approaches to setting learning rates rather than relying purely on empirical tuning.
The kinetic equation derived for wide networks provides a bridge between microscopic (individual weight) and macroscopic (population-level) descriptions, connecting to Wasserstein gradient flows. Future work may extend these theoretical insights to deeper architectures and alternative optimization algorithms, potentially influencing how practitioners approach neural network training at scale.
- βA free energy model successfully tracks oscillation dynamics in gradient descent when learning rates are large enough to induce instability.
- βThe framework applies to two-layer neural networks and has been validated on both synthetic matrix factorization and real-world CIFAR-10 benchmarks.
- βThe derived kinetic equation describes joint evolution of weights and their fluctuations as a Wasserstein-2 gradient flow.
- βThe model enables prediction of training loss spikes and envelope behavior without requiring knowledge of fine-grained oscillation details.
- βThese theoretical results provide new mathematical foundations for understanding and potentially optimizing neural network training with aggressive learning rates.