y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 6/10

From $\boldsymbol{\log\pi}$ to $\boldsymbol{\pi}$: Taming Divergence in Soft Clipping via Bilateral Decoupled Decay of Probability Gradient Weight

arXiv – CS AI|Xiaoliang Fu, Jiaye Lin, Yangyi Fang, Chaowen Hu, Cong Qin, Zekai Shao, Binbin Zheng, Lu Pan, Ke Zeng|
🤖AI Summary

Researchers introduce Decoupled Gradient Policy Optimization (DGPO), a new reinforcement learning method that improves large language model training by using probability gradients instead of log-probability gradients. The technique addresses instability issues in current methods while maintaining exploration capabilities, showing superior performance across mathematical benchmarks.

Key Takeaways
  • DGPO uses probability gradients rather than log-probability gradients to solve training instability issues in large language models.
  • The method resolves the conflict between stability and exploration in reinforcement learning with verifiable rewards.
  • Testing across DeepSeek-R1-Distill-Qwen models (1.5B/7B/14B parameters) showed consistent outperformance on mathematical benchmarks.
  • Traditional hard clipping methods stifle exploration while soft clipping methods suffer from divergent weights as probabilities approach zero.
  • The decoupled decay mechanism uses asymmetric, continuous decay based on importance sampling ratios for boundary tokens.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles