y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

arXiv – CS AI|Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao||5 views
🤖AI Summary

Researchers propose Generalized Primal Averaging (GPA), a new optimization method that improves training speed for large language models by 8-10% over standard AdamW while using less memory. GPA unifies and enhances existing averaging-based optimizers like DiLoCo by enabling smooth iterate averaging at every step without complex two-loop structures.

Key Takeaways
  • GPA achieves 8.71%, 10.13%, and 9.58% speedups over AdamW baseline for Llama-160M, 1B, and 8B models respectively.
  • The method reduces memory overhead compared to DiLoCo by eliminating the memory-intensive two-loop structure.
  • GPA unifies recent averaging-based optimizers like DiLoCo and Schedule-Free within a single framework.
  • On ImageNet ViT workloads, GPA demonstrates speedups of 7% and 25.5% in small and large batch settings.
  • Theoretical analysis proves GPA matches or exceeds convergence guarantees of base optimizers with O(√T) regret.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles