#ppo-optimization News & Analysis

2 articles tagged with #ppo-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AIBullisharXiv – CS AI · Jun 27/10

🧠

PR2: Predictive Routing Replay for MoE-Based LLM Reinforcement Learning

Researchers propose Predictive Routing Replay (PR2), a technique to stabilize reinforcement learning training on Mixture of Experts LLMs by predicting router evolution and reducing the mismatch between rollout and training phases. The method addresses router drift—a critical instability source in MoE-based models undergoing RL fine-tuning—through lightweight prediction mechanisms that anticipate expert activation changes.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

Researchers propose CPPO (Cumulative Prefix-divergence Policy Optimization), a new reinforcement learning method that improves upon standard PPO approaches for LLM training by accounting for position-dependent effects and cumulative policy divergence. The method uses position-weighted thresholds and prefix budgets to better regulate token-level deviations during autoregressive generation, showing improved training stability and reasoning accuracy across model scales.