Gradient Descent with Large Step Size Restores Symmetry in Deep Linear Networks with Multi-Pathway
Researchers demonstrate that discrete Gradient Descent with large step sizes produces fundamentally different training dynamics in deep linear networks compared to continuous Gradient Flow. Their analysis reveals that multi-pathway networks redistribute signals across pathways during later training stages rather than concentrating them in single pathways, challenging prevailing theoretical predictions and suggesting that optimization step size significantly influences neural network representation learning.
This theoretical research addresses a fundamental disconnect between continuous and discrete optimization in deep learning. While Gradient Flow analysis predicted 'winner-takes-all' specialization where each feature concentrates in a single pathway, the authors demonstrate that realistic discrete Gradient Descent with appropriately large step sizes produces opposite behavior. The key insight centers on sharpness: single-pathway solutions create sharp minima, while distributed representations across multiple pathways reduce sharpness—a property that becomes increasingly pronounced with network depth and pathway count. This distinction matters because large-step Gradient Descent naturally gravitates toward flatter minima due to oscillations at the Edge of Stability, a phenomenon absent in continuous-time gradient flow analysis. The research bridges a critical gap between theoretical predictions and practical neural network behavior. For deep learning practitioners, this suggests that architectural depth and optimization hyperparameters jointly determine how networks organize learned representations. Rather than converging to specialized single-pathway solutions, appropriately tuned discrete optimization drives networks toward shared representations distributed across pathways. This finding has implications for understanding why over-parameterized neural networks generalize well—shared representations may provide better regularization than specialized pathways. The work emphasizes that continuous approximations, while mathematically elegant, may miss important phenomena governing real neural network training, particularly regarding step size effects and stability dynamics that shape final network structure.
- →Large-step Gradient Descent creates network dynamics fundamentally different from theoretical Gradient Flow predictions in multi-pathway deep linear networks
- →Single-pathway solutions form sharp minima while distributed representations reduce sharpness, with this effect scaling with network depth
- →Edge of Stability oscillations drive networks toward re-balancing phases where signals redistribute across pathways rather than concentrating
- →Discrete optimization step size selection significantly influences whether networks develop specialized or shared representations
- →Continuous-time theoretical analyses may miss critical optimization phenomena relevant to practical neural network training dynamics