AIBullisharXiv – CS AI · 15h ago6/10
🧠
One LR Doesn't Fit All: Heavy-Tail Guided Layerwise Learning Rates for LLMs
Researchers introduce Layerwise Learning Rate (LLR), an adaptive training technique that assigns different learning rates to individual Transformer layers based on Heavy-Tailed Self-Regularization theory. Testing across multiple LLM architectures and scales demonstrates up to 1.5x training speedup and improved generalization, with zero-shot accuracy improvements of 2-3% on billion-parameter models.