y0news
← Feed
←Back to feed
🧠 AIβšͺ NeutralImportance 6/10

Gradient descent at the Edge of Stability: free energy model and kinetic description of the two-layer network

arXiv – CS AI|Antonin Chodron de Courcel|
πŸ€–AI Summary

Researchers propose a continuous-time mathematical model for analyzing gradient descent dynamics in the Edge of Stability regime, where large learning rates cause oscillations in neural network training. The model introduces an effective free energy framework that combines risk with a curvature-related term, enabling better prediction of training dynamics in wide two-layer networks and validated on matrix factorization and CIFAR-10 tasks.

Analysis

This theoretical research addresses a fundamental challenge in deep learning optimization: understanding the behavior of gradient descent when learning rates are set aggressively enough to cause persistent oscillations. The Edge of Stability regime, characterized by unstable yet controlled oscillation patterns, has been empirically observed in modern neural network training but lacked rigorous mathematical characterization until now.

The work builds on recent advances in understanding non-equilibrium dynamics in machine learning. Previous research identified that large learning rates could induce beneficial oscillatory behavior, but tracking the envelope of these oscillations while weights simultaneously evolve remained unsolved. The researchers' innovation lies in introducing an effective free energy functional that captures both the original loss landscape and entropic contributions from weight fluctuations, providing a unified framework for analysis.

This has meaningful implications for practitioners designing neural network training pipelines. By establishing that an effective free energy predicts oscillation envelopes, the framework offers new tools for hyperparameter selection and learning rate scheduling. Understanding when and why oscillations occur could enable more principled approaches to setting learning rates rather than relying purely on empirical tuning.

The kinetic equation derived for wide networks provides a bridge between microscopic (individual weight) and macroscopic (population-level) descriptions, connecting to Wasserstein gradient flows. Future work may extend these theoretical insights to deeper architectures and alternative optimization algorithms, potentially influencing how practitioners approach neural network training at scale.

Key Takeaways
  • β†’A free energy model successfully tracks oscillation dynamics in gradient descent when learning rates are large enough to induce instability.
  • β†’The framework applies to two-layer neural networks and has been validated on both synthetic matrix factorization and real-world CIFAR-10 benchmarks.
  • β†’The derived kinetic equation describes joint evolution of weights and their fluctuations as a Wasserstein-2 gradient flow.
  • β†’The model enables prediction of training loss spikes and envelope behavior without requiring knowledge of fine-grained oscillation details.
  • β†’These theoretical results provide new mathematical foundations for understanding and potentially optimizing neural network training with aggressive learning rates.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles