It's Not a Lottery, It's a Race: Understanding How Gradient Descent Adapts the Network's Capacity to the Task
Researchers have identified three fundamental dynamical principles—mutual alignment, unlocking, and racing—that explain how gradient descent training reduces neural network capacity to match task requirements. This theoretical advancement clarifies the mechanisms behind the lottery ticket hypothesis and why certain initial neuron conditions lead to higher weight norms, bridging a significant gap between empirical neural network success and theoretical understanding.
This research addresses a critical gap in deep learning theory by explaining how neural networks naturally compress their capacity during training. The findings emerge from analyzing single hidden layer ReLU networks at the neuron level, where gradient descent operates through three coordinated mechanisms: mutual alignment ensures neurons specialize toward task-relevant features, unlocking allows neurons to escape initial constraints, and racing creates competition where beneficial neurons accumulate larger weight norms. These principles work together to explain why post-training network pruning and neuron merging often succeed without performance loss.
The theoretical framework validates and mechanistically explains the lottery ticket conjecture, a previously empirical observation that randomly initialized networks contain subnetworks capable of learning efficiently. Rather than viewing this as random luck, the research demonstrates that gradient descent deterministically identifies and amplifies advantageous initial conditions through the racing mechanism. This understanding has profound implications for neural network design and efficiency.
For practitioners and researchers, these insights suggest that network capacity reduction is not a post-hoc optimization technique but an inherent property of gradient descent dynamics. This could enable more principled approaches to model compression, sparse training, and architecture design. The theoretical clarity may also improve transfer learning and few-shot learning approaches by better understanding how networks adapt capacity to novel tasks.
Future investigation should extend these principles to deeper networks and more complex architectures. Understanding capacity adaptation mechanisms may reveal opportunities for more sample-efficient training, better generalization bounds, and novel regularization techniques that align with natural gradient descent dynamics.
- →Gradient descent reduces neural network capacity through three mechanisms: mutual alignment, unlocking, and racing that work together during training.
- →The lottery ticket hypothesis is mechanistically explained by the racing principle, where beneficial initial conditions lead to higher weight norms deterministically.
- →Post-training pruning and neuron merging succeed because gradient descent naturally compresses networks to task-relevant subnetworks.
- →Understanding capacity adaptation dynamics could enable more efficient model compression and design strategies.
- →Research narrows the gap between empirical neural network success and theoretical understanding through neuron-level analysis.