#policy-optimization News & Analysis

31 articles tagged with #policy-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

31 articles

AIBullisharXiv – CS AI · Mar 27/1015

🧠

Real-Time Aligned Reward Model beyond Semantics

Researchers introduce R2M (Real-Time Aligned Reward Model), a new framework for Reinforcement Learning from Human Feedback (RLHF) that addresses reward overoptimization in large language models. The system uses real-time policy feedback to better align reward models with evolving policy distributions during training.

AIBullisharXiv – CS AI · Feb 276/104

🧠

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Researchers have developed Hierarchy-of-Groups Policy Optimization (HGPO), a new reinforcement learning method that improves AI agents' performance on long-horizon tasks by addressing context inconsistency issues in stepwise advantage estimation. The method shows significant improvements over existing approaches when tested on challenging agentic tasks using Qwen2.5 models.

AIBullisharXiv – CS AI · Feb 275/106

🧠

Improving Discrete Diffusion Unmasking Policies Beyond Explicit Reference Policies

Researchers developed a learned scheduler for masked diffusion models (MDMs) in language modeling that outperforms traditional rule-based approaches. The new method uses a KL-regularized Markov decision process framework and demonstrated significant improvements, including 20.1% gains over random scheduling and 11.2% over max-confidence approaches on benchmark tests.

AINeutralarXiv – CS AI · Feb 275/108

🧠

Soft Sequence Policy Optimization

Researchers introduce Soft Sequence Policy Optimization (SSPO), a new reinforcement learning method for training Large Language Models that improves upon existing policy optimization approaches. The technique uses soft gating functions and sequence-level importance sampling to enhance training stability and performance in mathematical reasoning tasks.

AINeutralarXiv – CS AI · Mar 34/104

🧠

Rethinking Policy Diversity in Ensemble Policy Gradient in Large-Scale Reinforcement Learning

Researchers propose Coupled Policy Optimization (CPO), a new reinforcement learning method that regulates policy diversity through KL constraints to improve exploration efficiency in large-scale parallel environments. The method outperforms existing baselines like PPO and SAPG across multiple tasks, demonstrating that controlled diverse exploration is key to stable and sample-efficient learning.

AINeutralarXiv – CS AI · Mar 24/105

🧠

Beyond State-Wise Mirror Descent: Offline Policy Optimization with Parameteric Policies

Researchers present theoretical advances in offline reinforcement learning that extend beyond current limitations to work with parameterized policies over large or continuous action spaces. The work connects mirror descent to natural policy gradient methods and reveals a surprising unification between offline RL and imitation learning.

← PrevPage 2 of 2