#policy-gradient News & Analysis

23 articles tagged with #policy-gradient. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

23 articles

AIBullisharXiv – CS AI · Jun 57/10

🧠

OrderGrad: Optimizing Beyond the Mean with Order-Statistic Policy Gradient Estimation

OrderGrad introduces a family of gradient estimators that optimize order-statistic objectives rather than expected returns, enabling policy-gradient methods to directly target risk-sensitive metrics like Value-at-Risk, Conditional Value-at-Risk, and best-of-K outcomes. The method works as a plug-and-play reward transformation compatible with standard reinforcement learning algorithms, with applications demonstrated in LLM post-training and other domains.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Proximal Supervised Fine-Tuning

Researchers propose Proximal Supervised Fine-Tuning (PSFT), a new method that applies trust-region constraints from reinforcement learning to improve how foundation models adapt to new tasks. The technique maintains model capabilities while fine-tuning, outperforming standard supervised fine-tuning on out-of-domain generalization tasks.

AIBullisharXiv – CS AI · Mar 167/10

🧠

Reinforcement Learning for Diffusion LLMs with Entropy-Guided Step Selection and Stepwise Advantages

Researchers developed a new reinforcement learning approach for training diffusion language models that uses entropy-guided step selection and stepwise advantages to overcome challenges with sequence-level likelihood calculations. The method achieves state-of-the-art results on coding and logical reasoning benchmarks while being more computationally efficient than existing approaches.

AIBullisharXiv – CS AI · Mar 57/10

🧠

HALyPO: Heterogeneous-Agent Lyapunov Policy Optimization for Human-Robot Collaboration

Researchers developed HALyPO (Heterogeneous-Agent Lyapunov Policy Optimization), a new approach to improve stability in human-robot collaboration through multi-agent reinforcement learning. The method addresses the 'rationality gap' between human and robot learning by using Lyapunov stability conditions to prevent policy oscillations and divergence during training.

AINeutralarXiv – CS AI · Mar 47/103

🧠

Benefits and Pitfalls of Reinforcement Learning for Language Model Planning: A Theoretical Perspective

New research provides theoretical analysis of reinforcement learning's impact on Large Language Model planning capabilities, revealing that RL improves generalization through exploration while supervised fine-tuning may create spurious solutions. The study shows Q-learning maintains output diversity better than policy gradient methods, with findings validated on real-world planning benchmarks.

AINeutralarXiv – CS AI · Jun 236/10

🧠

A Formula-Driven Survey and Research Agenda for On-Policy Distillation

This arXiv paper presents a comprehensive taxonomy and research framework for on-policy distillation (OPD), a technique for training large language models using feedback from current or recent student policies. The work moves beyond single loss functions to analyze OPD as a systematic feedback-to-update problem, introducing new methods like Counterfactual Routed OPD (CR-OPD) and identifying critical mechanisms affecting model stability and performance.

AINeutralarXiv – CS AI · Jun 46/10

🧠

Outcome-Based RL Provably Leads Transformers to Reason, but Only With the Right Data

Researchers prove that Transformers trained with reinforcement learning and outcome-based rewards spontaneously develop chain-of-thought reasoning capabilities, but only when training data includes sufficient 'simple examples' requiring fewer reasoning steps. The findings bridge theory and practice, explaining how sparse reward signals drive emergence of interpretable algorithmic behavior in language models.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

Researchers introduce ReMax, a reinforcement learning objective that naturally induces exploration by evaluating policies over multiple samples, and develop RePPO, a PPO variant that achieves exploration without explicit bonus terms. The approach generalizes discrete retry counts to a continuous parameter, enabling fine-grained control of exploration in policy gradient methods.

AINeutralarXiv – CS AI · Jun 16/10

🧠

REAL: Regression-Aware Reinforcement Learning for LLM-as-a-Judge

Researchers introduce REAL, a reinforcement learning framework that optimizes LLMs used as automated evaluators by recognizing ordinal relationships in scoring tasks rather than treating outputs as binary outcomes. The method demonstrates significant performance improvements across model scales, achieving up to +8.40 Pearson correlation gains on Qwen3-32B compared to supervised fine-tuning baselines.

AINeutralarXiv – CS AI · May 296/10

🧠

Self-Play Reinforcement Learning under Imperfect Information in Big 2

Researchers develop a self-play reinforcement learning framework for Big 2, a four-player imperfect-information card game, demonstrating that PPO outperforms value-based methods under controlled conditions. The study reveals that entropy regularization and current-policy self-play improve agent performance, establishing Big 2 as a useful benchmark for testing deep RL in complex multi-agent environments with hidden information and variable action spaces.

AINeutralarXiv – CS AI · May 296/10

🧠

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

Researchers introduce Recurrent Structural Policy Gradient (RSPG), an algorithmic advancement for solving Mean Field Games with partial observability by combining policy gradient methods with structural knowledge of system dynamics. The method achieves significantly faster convergence than model-free approaches while enabling history-aware behavior, accompanied by MFAX, a new JAX-based research framework for MFG implementations.

AIBullisharXiv – CS AI · May 286/10

🧠

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

Researchers introduce ProRL, a reinforcement learning framework designed to improve proactive recommender systems that guide users toward target items through sequential recommendations. The approach addresses fundamental gradient estimation problems in policy learning by implementing stepwise reward centering and position-specific advantage estimation, demonstrating superior performance on real-world datasets.

AINeutralarXiv – CS AI · May 276/10

🧠

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

Researchers identify critical failure modes in policy-gradient reinforcement learning methods when applied to long-horizon problems with cumulative damage, where short-term attractive actions lead to long-term negative outcomes. The study proposes a decomposition framework separating completion (reaching terminal horizon) from optimality (achieving dynamic-programming benchmarks) and validates predictions across two distinct domains: career planning and sports performance.

AINeutralarXiv – CS AI · May 276/10

🧠

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

Researchers propose Credit-Assigned Policy Gradient (CA-PG), a new machine learning technique that solves the variance problem in training early-stage rankers for two-stage retrieval systems. By computing gradients with respect to individual item selection probability rather than entire candidate sets, CA-PG enables scalable end-to-end training of search and recommendation systems.

AINeutralarXiv – CS AI · May 276/10

🧠

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

Researchers introduce SDPG, a visual reinforcement learning method that trains robotic control policies significantly faster and more efficiently on consumer GPUs. The approach reduces computational overhead through stochastic gradient estimation while maintaining superior performance, and includes new benchmarks for advancing visual robotics research.

🏢 Nvidia

AINeutralarXiv – CS AI · May 276/10

🧠

Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

Researchers propose PANDA, a novel bilevel optimization algorithm for reinforcement learning that handles competitive multi-agent scenarios modeled as zero-sum Markov games. The method achieves state-of-the-art convergence rates without requiring second-order derivatives, advancing RL applications in incentive design and competitive environments.

AINeutralarXiv – CS AI · May 96/10

🧠

On the optimization dynamics of RLVR: Gradient gap and step size thresholds

Researchers provide theoretical foundations for Reinforcement Learning with Verifiable Rewards (RLVR), a technique for post-training large language models using binary feedback. The analysis introduces the 'Gradient Gap' concept to explain convergence dynamics and derives critical step-size thresholds that determine whether training succeeds or fails, with implications for practical implementations like length normalization.

AIBullisharXiv – CS AI · Mar 176/10

🧠

Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning

Researchers introduce Slow-Fast Policy Optimization (SFPO), a new reinforcement learning framework that improves training stability and efficiency for large language model reasoning. SFPO outperforms existing methods like GRPO by up to 2.80 points on math benchmarks while requiring up to 4.93x fewer rollouts and 4.19x less training time.

AIBullisharXiv – CS AI · Mar 36/108

🧠

Reinforcement Learning for Control with Probabilistic Stability Guarantee: A Finite-Sample Approach

Researchers have developed L-REINFORCE, a novel reinforcement learning algorithm that provides probabilistic stability guarantees for control systems using finite data samples. The approach bridges reinforcement learning and control theory by extending classical REINFORCE algorithms with Lyapunov stability methods, demonstrating superior performance in Cartpole simulations.

AINeutralarXiv – CS AI · Mar 36/104

🧠

Distributions as Actions: A Unified Framework for Diverse Action Spaces

Researchers introduce a new reinforcement learning framework called Distributions-as-Actions (DA) that treats parameterized action distributions as actions, making all action spaces continuous regardless of original type. The approach includes a new policy gradient estimator (DA-PG) with lower variance and a practical actor-critic algorithm (DA-AC) that shows competitive performance across discrete, continuous, and hybrid control tasks.

AIBullisharXiv – CS AI · Mar 26/1014

🧠

Trust Region Masking for Long-Horizon LLM Reinforcement Learning

Researchers propose Trust Region Masking (TRM) to address off-policy mismatch problems in Large Language Model reinforcement learning pipelines. The method provides the first non-vacuous monotonic improvement guarantees for long-horizon LLM-RL tasks by masking entire sequences that violate trust region constraints.

AINeutralOpenAI News · Mar 203/105

🧠

Variance reduction for policy gradient with action-dependent factorized baselines

This appears to be a research paper on policy gradient methods in reinforcement learning, specifically focusing on variance reduction techniques using action-dependent factorized baselines. The article lacks content details, making it difficult to assess specific findings or implications.

AINeutralHugging Face Blog · Jun 301/103

🧠

Policy Gradient with PyTorch

The article appears to be about implementing policy gradient algorithms using the PyTorch framework. However, the article body is empty, making it impossible to provide meaningful analysis of the content or its implications.