#off-policy News & Analysis

2 articles tagged with #off-policy. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AINeutralarXiv – CS AI · 9h ago6/10

🧠

VESPO: Variational Sequence-Level Soft Policy Optimization for Stable Off-Policy LLM Training

Researchers introduce VESPO, a new method for training large language models using reinforcement learning that solves the variance problem in off-policy updates. The technique uses a principled mathematical approach to weight sequences rather than tokens, enabling stable training even when data becomes stale, with demonstrated improvements on math and code generation tasks.

AIBullisharXiv – CS AI · Mar 36/104

🧠

Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends

Researchers demonstrate that Group Relative Policy Optimization (GRPO), traditionally viewed as an on-policy reinforcement learning algorithm, can be reinterpreted as an off-policy algorithm through first-principles analysis. This theoretical breakthrough provides new insights for optimizing reinforcement learning applications in large language models and offers principled approaches for off-policy RL algorithm design.