y0news
← Feed
←Back to feed
🧠 AI🟒 BullishImportance 7/10

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

arXiv – CS AI|Aaron Defazio, Konstantin Mishchenko, Parameswaran Raman, Hao-Jun Michael Shi, Lin Xiao||19 views
πŸ€–AI Summary

Researchers propose Generalized Primal Averaging (GPA), a new optimization method that improves training speed for large language models by 8-10% over standard AdamW while using less memory. GPA unifies and enhances existing averaging-based optimizers like DiLoCo by enabling smooth iterate averaging at every step without complex two-loop structures.

Key Takeaways
  • β†’GPA achieves 8.71%, 10.13%, and 9.58% speedups over AdamW baseline for Llama-160M, 1B, and 8B models respectively.
  • β†’The method reduces memory overhead compared to DiLoCo by eliminating the memory-intensive two-loop structure.
  • β†’GPA unifies recent averaging-based optimizers like DiLoCo and Schedule-Free within a single framework.
  • β†’On ImageNet ViT workloads, GPA demonstrates speedups of 7% and 25.5% in small and large batch settings.
  • β†’Theoretical analysis proves GPA matches or exceeds convergence guarantees of base optimizers with O(√T) regret.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles