Pro-KLShampoo: Projected KL-Shampoo with Whitening Recovered by Orthogonalization
Researchers introduce Pro-KLShampoo, an improved optimizer for LLM pre-training that combines Kronecker-factored preconditioning with gradient orthogonalization. By exploiting the observed spike-and-flat eigenvalue structure in KL-Shampoo's preconditioners, Pro-KLShampoo achieves better validation loss, reduced memory usage, and faster training across multiple model scales.
Pro-KLShampoo represents a meaningful advancement in optimizer design for large language model training, addressing a practical gap between two previously separate optimization approaches. The research identifies that KL-Shampoo's Kronecker preconditioners consistently exhibit a spike-and-flat eigenvalue pattern—dominated eigenvalues followed by uniform tails—across different layers and training stages. This structural insight enables a hybrid approach that maintains full spectral resolution in a tracked r-dimensional subspace while applying orthogonalization to remaining dimensions, mathematically recovering KL-Shampoo's full preconditioner form with improved efficiency.
The advancement builds on existing momentum in matrix-structure exploitation for LLM optimization, where methods like KL-Shampoo and Muon have shown benefits for pre-training at scale. Rather than choosing between explicit Kronecker factorization and orthogonalization-based approaches, Pro-KLShampoo synthesizes both techniques, leveraging empirical gradient structure to reduce computational overhead.
For practitioners training large models, the results across GPT-2 and LLaMA variants demonstrate consistent improvements in three critical metrics: validation loss convergence, peak GPU memory consumption, and wall-clock training time. These gains compound significantly at production scales, where training costs dominate model development budgets. The ability to achieve better performance with reduced memory enables either faster training on existing hardware or training of larger models within fixed resource constraints.
Future work likely involves validating this approach across diverse architectures, investigating whether the spike-and-flat structure generalizes to other optimizer families, and exploring adaptive rank selection mechanisms that dynamically adjust r during training based on observed spectral properties.
- →Pro-KLShampoo combines Kronecker-factored preconditioning with orthogonalization by exploiting spike-and-flat eigenvalue structures in gradient matrices
- →The method demonstrates consistent improvements in validation loss, memory usage, and training time across GPT-2 and LLaMA models
- →Hybrid approach bridges two previously isolated optimizer design paradigms through structural observation of gradient properties
- →Practical gains compound at production scales where training costs represent major expenses in model development
- →Mathematical proof shows orthogonalization on non-dominant eigenvalue directions recovers full KL-Shampoo preconditioner form