One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Researchers introduce Layerwise Learning Rate (LLR), an adaptive training technique that assigns different learning rates to individual Transformer layers based on Heavy-Tailed Self-Regularization theory. Testing across multiple LLM architectures and scales demonstrates up to 1.5x training speedup and improved generalization, with zero-shot accuracy improvements of 2-3% on billion-parameter models.
LLR addresses a fundamental inefficiency in modern LLM training: the assumption that all Transformer layers benefit from identical learning rates. By analyzing the empirical spectral density of weight correlation matrices, the method quantifies each layer's heavy-tailedness—a theoretical measure of stability and regularization strength. Layers exhibiting weaker heavy-tailedness receive larger learning rates to accelerate learning, while heavily-tailed layers use smaller rates to prevent destabilization.
This approach builds on established theoretical frameworks in deep learning that recognize structural heterogeneity within neural networks. Transformer architectures inherently contain diverse layer types with distinct roles in processing information, yet prevailing optimization practices ignore these differences. The research validates this intuition across multiple dimensions: different model sizes (60M to 3B parameters), architectures (LLaMA, GPT-nano), and optimizers (AdamW, Muon), demonstrating that the principle generalizes broadly.
The practical implications extend beyond training efficiency. Achieving 1.5x speedup reduces computational costs and environmental impact during model development. The accuracy improvements—particularly the 2-3% gains on billion-parameter models—translate to more capable systems without architectural changes or additional parameters. The transfer learning aspect proves especially valuable; practitioners can derive layerwise settings from existing uniform-learning-rate experiments rather than conducting expensive hyperparameter searches.
The availability of open-source implementation accelerates adoption. As LLM training becomes increasingly expensive and competitive, techniques that improve efficiency gain substantial leverage. Future work likely explores whether these principles extend to other architectural components or multimodal models. The theoretical foundation using heavy-tail analysis opens investigation into other spectral properties that might guide optimization decisions.
- →Layerwise learning rates based on heavy-tail theory achieve up to 1.5x training speedup compared to uniform learning rate approaches.
- →Method improves zero-shot accuracy by 2-3% on billion-parameter models without increasing model size or parameters.
- →Technique transfers nearly optimal settings from uniform baseline experiments, eliminating expensive hyperparameter search overhead.
- →Approach generalizes across multiple architectures, optimizers, and model scales from 60M to 3B parameters with 100B training tokens.
- →Open-source implementation enables practical adoption in production LLM training pipelines.