Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View
Researchers propose Low-Rank Decay (LRD), a spectral regularization technique that improves generalization in scale-invariant Transformer architectures by compressing weight singular values after memorization. Unlike standard L2 decay, LRD remains effective in normalized models and accelerates grokking—the delayed generalization phenomenon—on algorithmic tasks.
This research addresses a fundamental limitation in training modern Transformers that employ normalization mechanisms like RMSNorm and Query-Key Normalization. These architectural choices create scale-invariant weight spaces where traditional L2 regularization becomes ineffective after models memorize training data, since it only acts radially rather than reshaping the learned function. The authors identify this gap and propose Low-Rank Decay, a nuclear-norm-based regularizer that continues reshaping weight spectra even when task gradients vanish.
The work builds on decades of regularization research while accounting for contemporary architectural trends. Grokking—where models suddenly generalize long after memorizing training sets—remains poorly understood. By demonstrating that LRD induces rapid rank collapse in Query/Key matrices and extends the data-fraction boundary for grokking onset, the researchers provide both empirical evidence and theoretical grounding through spectral-geometric analysis. Their needle-to-fan interpretation of the nuclear-norm subdifferential near low-rank strata offers new mathematical insights into optimization dynamics.
For practitioners, this suggests that regularizer choice fundamentally matters in normalized architectures, contradicting assumptions built into existing training pipelines. The method could improve sample efficiency and reduce computational costs by enabling generalization with less data. However, applicability extends primarily to small algorithmic tasks demonstrated in the paper; scaling to large language models or other domains remains unvalidated. The work emphasizes that architectural design choices create optimization landscapes where classical techniques fail, necessitating purpose-built solutions for normalized layers.
- →Low-Rank Decay outperforms L2 regularization in scale-invariant Transformers by maintaining tangential weight-space effects after memorization
- →LRD induces effective-rank collapse in Query/Key matrices and significantly expands the data-fraction window for grokking emergence
- →Standard weight decay becomes radially-constrained and ineffective in normalized architectures, requiring alternative regularization approaches
- →The spectral-geometric framework reveals that nuclear-norm regularizers fundamentally differ from Frobenius-norm decay in normalized regimes
- →Results demonstrate improved sample efficiency on modular arithmetic tasks, though scaling to large models remains an open question