🧠 AI🟢 BullishImportance 7/10

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

arXiv – CS AI|Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao|March 2, 2026 at 05:00 AM|19 views

🤖AI Summary

Researchers propose Generalized Primal Averaging (GPA), a new optimization method that improves training speed for large language models by 8-10% over standard AdamW while using less memory. GPA unifies and enhances existing averaging-based optimizers like DiLoCo by enabling smooth iterate averaging at every step without complex two-loop structures.

Key Takeaways

→GPA achieves 8.71%, 10.13%, and 9.58% speedups over AdamW baseline for Llama-160M, 1B, and 8B models respectively.
→The method reduces memory overhead compared to DiLoCo by eliminating the memory-intensive two-loop structure.
→GPA unifies recent averaging-based optimizers like DiLoCo and Schedule-Free within a single framework.
→On ImageNet ViT workloads, GPA demonstrates speedups of 7% and 25.5% in small and large batch settings.
→Theoretical analysis proves GPA matches or exceeds convergence guarantees of base optimizers with O(√T) regret.