y0news
← Feed
←Back to feed
🧠 AI🟒 BullishImportance 6/10

From $\boldsymbol{\log\pi}$ to $\boldsymbol{\pi}$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

arXiv – CS AI|Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng|
πŸ€–AI Summary

Researchers introduce Decoupled Gradient Policy Optimization (DGPO), a new reinforcement learning method that improves large language model training by using probability gradients instead of log-probability gradients. The technique addresses instability issues in current methods while maintaining exploration capabilities, showing superior performance across mathematical benchmarks.

Key Takeaways
  • β†’DGPO uses probability gradients rather than log-probability gradients to solve training instability issues in large language models.
  • β†’The method resolves the conflict between stability and exploration in reinforcement learning with verifiable rewards.
  • β†’Testing across DeepSeek-R1-Distill-Qwen models (1.5B/7B/14B parameters) showed consistent outperformance on mathematical benchmarks.
  • β†’Traditional hard clipping methods stifle exploration while soft clipping methods suffer from divergent weights as probabilities approach zero.
  • β†’The decoupled decay mechanism uses asymmetric, continuous decay based on importance sampling ratios for boundary tokens.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles