Spectral Dynamics in Deep Networks: Feature Learning, Outlier Escape, and Learning Rate Transfer
Researchers develop a dynamical mean-field theory framework to analyze how neural network weight spectra evolve during training, revealing that different parameterization schemes (μP vs NTK) produce fundamentally different outlier dynamics. The findings suggest that neural scaling laws and hyperparameter transfer depend critically on how outlier eigenvalues behave, with implications for understanding deep learning generalization and optimization.
This theoretical work addresses a fundamental gap in understanding neural network training dynamics by tracking both bulk and outlier spectral behavior simultaneously. Traditional analyses treat these components separately, but the authors' two-level DMFT framework reveals they interact in ways that significantly impact learning behavior. The research distinguishes between infinite-width nonlinear networks and deep linear networks, showing that μP parameterization achieves width-consistent outlier dynamics while standard NTK parameterization fails this consistency test despite eventual convergence.
The theoretical contributions stem from decades of mean-field theory research in statistical physics applied to machine learning. Recent advances in understanding neural tangent kernels and neural scaling laws created demand for better characterization of spectral phenomena. This work fills that gap by providing predictive equations for how outlier eigenvalues evolve with training time, initialization, and network width—factors previously treated as black boxes.
For practitioners training large models, these insights validate the empirical success of μP scaling laws used in modern deep learning. The framework explains why certain hyperparameter choices transfer across model scales while others fail dramatically. However, the analysis reveals fundamental limitations: tasks with many output channels (ImageNet, GPT modeling) show spectral bulk restructuring that defies the bulk+outlier picture, suggesting additional mechanisms govern feature learning in realistic settings.
Future work should extend this theory to capture bulk restructuring phenomena and validate predictions on modern architectures. Understanding these spectral dynamics could inform better initialization schemes, learning rate schedules, and width selection heuristics for practitioners.
- →μP parameterization enables width-stable outlier dynamics and hyperparameter transfer, while NTK parameterization shows strong width-dependence despite asymptotic convergence
- →The edge-of-stability phenomenon in the leading NTK mode exhibits consistent behavior under μP but not standard parameterization across different network widths
- →Simple tasks with few outputs follow bulk+outlier spectral dynamics, but large-output problems like ImageNet undergo fundamental spectral bulk restructuring during training
- →Dynamical mean-field theory successfully predicts outlier evolution as a function of training time, initialization variance, output scale, and network width
- →Current theory has limitations for realistic high-output scenarios, indicating additional feature learning mechanisms beyond the studied spectral framework