Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction
Researchers formalize the grokking phenomenon—where neural networks fit training data quickly but learn generalizable rules slowly—by analyzing deep linear networks and ReLU MLPs. The study identifies two distinct training timescales: fast classification loss decay and slower representation simplification, with implications for understanding how neural networks generalize.
This theoretical machine learning paper addresses a fundamental phenomenon in neural network training called grokking, where models exhibit delayed generalization despite fitting training data early. The researchers provide mathematical rigor by separating two temporal dynamics: rapid loss minimization and gradual representation learning. Their framework uses deep linear network theory as a foundation, proving that under certain conditions (post-margin gap-growth or tail-contraction), classification loss decays logarithmically while weight decay induces a Schatten-type regularization that operates on a polynomial timescale.
The work builds on growing recognition that neural network training involves multiple interacting processes operating at different speeds. Previous observations of grokking in modular arithmetic tasks have lacked formal explanation. This analysis bridges that gap by demonstrating how ReLU networks can reduce to linear models in fixed activation regions, enabling analysis of why classifiers learn faster than embeddings.
For the machine learning community, this research clarifies why simply minimizing training loss doesn't guarantee representation quality or generalization. The conditional reduction to linear dynamics in ReLU networks suggests that understanding network behavior requires examining local geometry rather than global dynamics. This has practical implications for training strategies: practitioners might achieve better generalization by explicitly controlling representation simplification rather than focusing solely on loss minimization.
Future work should extend these conditional results toward global proofs for nonlinear networks and validate whether insights from modular arithmetic transfer to realistic datasets and architectures. Understanding these training clocks could inform better optimization algorithms and regularization schemes.
- →Grokking separates into two timescales: fast loss decay (logarithmic) and slow representation learning (polynomial) under weight decay.
- →Deep linear network theory provides rigorous mathematical foundation for analyzing when and why representation simplification lags loss minimization.
- →ReLU networks can be analyzed via conditional reduction to linear models in regions with fixed activation patterns.
- →Classifier heads receive larger effective gradients than embedding blocks, explaining two-stage learning mechanisms.
- →Results use modular addition as experimental setting but extend theoretically to understanding broader neural network generalization.