#policy-optimization News & Analysis

143 articles tagged with #policy-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

143 articles

AIBearisharXiv – CS AI · Jun 237/10

🧠

Self-Improvement Can Self-Regress: The Rise-and-Collapse Failure Mode of LLM Self-Training

Researchers identify a critical failure mode in LLM self-training where models improve rapidly then collapse during REINFORCE post-training on coding tasks. The study tests three intervention strategies—CARE, early stopping, and GRPO—finding that effectiveness varies by model size and that none fully eliminates the within-task policy over-optimization problem.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Scheduling Thoughts: Learning the Order of Thought in Diffusion Language Models

Researchers introduce Self-Aware Scheduling (SAS), a method that learns optimal token unmasking orders in masked diffusion language models through policy optimization. The approach significantly improves generation quality on reasoning tasks, achieving 91.8% accuracy on Sudoku (up from 82%) and boosting mathematical reasoning performance by 12 percentage points on GSM8K.

AIBullisharXiv – CS AI · Jun 237/10

🧠

Group-Graph Policy Optimization for Long-Horizon Agentic Reinforcement Learning

Researchers propose Group-Graph Policy Optimization (G2PO), a novel reinforcement learning algorithm that transforms linear interaction trajectories into state-transition graphs to improve credit assignment in long-horizon agentic tasks. The method demonstrates significant performance improvements on benchmark tasks like WebShop and ALFWorld, achieving up to 22.2% success rate gains over existing approaches.

AIBullisharXiv – CS AI · Jun 237/10

🧠

VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

Researchers propose VRPO, a reinforcement learning framework that strengthens value modeling to handle noisy reward signals in large language model post-training. The approach uses auxiliary losses and information bottleneck techniques to enable value models to filter noise and generate more reliable advantage estimates, outperforming standard methods like PPO and GRPO across dialogue, math, and QA tasks.

🏢 Perplexity

AIBullisharXiv – CS AI · Jun 197/10

🧠

ENPIRE: Agentic Robot Policy Self-Improvement in the Real World

Researchers introduce ENPIRE, a framework that enables AI coding agents to autonomously improve robot manipulation policies through real-world feedback loops without human intervention. The system achieves 99% success rates on complex dexterous tasks like pin box organization and tool use, demonstrating that AI agents can now conduct independent robotics research in physical environments.

🏢 Meta

AIBullisharXiv – CS AI · Jun 117/10

🧠

The Art of Interrogation: Consistency Amplifies Factuality in Spatial Reasoning

Researchers propose a self-supervised reinforcement learning framework that improves large language models' spatial reasoning capabilities through consistency verification rather than labeled data. The approach, which uses geometric and semantic consistency checks across image and text transformations, achieves performance comparable to supervised fine-tuning without ground-truth annotations.

AIBullisharXiv – CS AI · Jun 107/10

🧠

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

Researchers introduce 3SPO (State-Score-Supervised Policy Optimization), a reinforcement learning algorithm that optimizes LLM agent policies at each step rather than after complete episodes, addressing credit assignment challenges in sparse-reward environments. Experiments demonstrate 22.6% improvement over existing methods on ALFWorld benchmarks with 2.4x more state exploration and 1.8x faster convergence.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Test-Time Gradient Guidance of Flow Policies in Reinforcement Learning

Researchers propose QGF (Q-Guided Flow), a reinforcement learning algorithm that optimizes policies entirely at test time using value gradients to guide pre-trained flow models, avoiding the training instability issues of traditional actor-critic approaches while maintaining competitive performance on offline RL benchmarks.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Reward Shaping for (Inference-Time) Alignment: A Stackelberg Game Perspective

Researchers propose a Stackelberg game framework for optimizing reward models in large language model alignment, addressing the suboptimality of standard KL-regularized reward optimization. A simple reward shaping scheme improves inference-time alignment by reducing base policy bias while mitigating reward hacking risks, demonstrating 66%+ win rates against baselines.

AIBullisharXiv – CS AI · Jun 87/10

🧠

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

Researchers introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables LLM agents to continuously adapt after deployment without gradient updates or fine-tuning. The method uses dynamic memory retrieval to estimate action advantages and modulate output logits, achieving state-of-the-art performance on complex tasks while reducing computational costs by over 30 times compared to traditional fine-tuning approaches.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Zero-Shot Off-Policy Learning

Researchers present a novel off-policy learning method that addresses distributional shift and value overestimation in zero-shot reinforcement learning by establishing a theoretical connection between successor measures and stationary density ratios. The approach enables agents to adapt to new tasks without additional training by inferring optimal importance sampling ratios on-the-fly, with successful benchmarks across motion tracking, continuous control, and long-horizon tasks.

AIBullisharXiv – CS AI · Jun 17/10

🧠

Distilling LLM Feedback for Lean Theorem Proving

Researchers propose Feedback Distillation, a novel post-training method for language models that improves reasoning tasks by having models learn from their own feedback at the token level. Applied to Lean4 theorem-proving, the approach outperforms standard GRPO methods in trajectory diversity and scalability while complementing existing reinforcement learning approaches.

AIBullisharXiv – CS AI · May 297/10

🧠

Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Researchers introduce Proactive Interactive Reasoning (PIR), a new paradigm that enables large language models to ask clarifying questions during problem-solving rather than operating blindly with incomplete information. The approach combines supervised fine-tuning and policy optimization to achieve significant improvements in mathematical reasoning, code generation, and document editing tasks while reducing computational overhead.

AIBullisharXiv – CS AI · May 277/10

🧠

Learning When to Think While Listening in Large Audio-Language Models

Researchers introduce a learnable control system for Large Audio-Language Models that dynamically decides when to process reasoning during real-time speech interactions. The approach balances responsiveness with accuracy by optimizing intermediate reasoning transparency, achieving 2.7% accuracy improvement while reducing latency on benchmark tasks.

AIBullisharXiv – CS AI · May 277/10

🧠

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

Researchers propose GraphGPO, a novel reinforcement learning method that improves credit assignment in agentic tasks by aggregating trajectories into a state-transition graph rather than relying on coarse-grained outcome-based attribution. This approach enables step-level credit recognition and achieves state-of-the-art performance on challenging benchmarks while significantly improving training efficiency.

AIBullisharXiv – CS AI · May 277/10

🧠

Credit Assignment with Resets in Language Model Reasoning

Researchers propose SRPO (Self-Reset Policy Optimization), a novel method that improves how language models learn from reasoning tasks by identifying and isolating problematic reasoning steps rather than treating entire solution trajectories uniformly. The technique uses the model itself to self-localize errors and reset to those points for resampling, outperforming standard approaches like GRPO without requiring external supervision.

AIBullisharXiv – CS AI · May 277/10

🧠

Rethinking the Trust Region in LLM Reinforcement Learning

Researchers propose Divergence Proximal Policy Optimization (DPPO), a replacement for PPO's ratio clipping mechanism that better handles the large vocabularies in LLM fine-tuning. The new approach uses direct policy divergence estimates instead of noisy token probability ratios, offering improved training stability and efficiency.

AIBullisharXiv – CS AI · May 127/10

🧠

RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models

Researchers introduce RePO-VLA, a policy optimization framework that improves Vision-Language-Action models' ability to recover from failures in complex manipulation tasks. The method increases adversarial robustness from 20% to 75% by learning from recovery trajectories rather than discarding failed attempts, with validation on both simulated and real-world robotic tasks.

AIBullisharXiv – CS AI · May 127/10

🧠

expo: Exploration-prioritized policy optimization via adaptive kl regulation and gaussian curriculum sampling

Researchers introduce EXPO, an improved reinforcement learning algorithm for LLM mathematical reasoning that dynamically adjusts KL penalty coefficients and prioritizes moderately difficult problems during training. The method demonstrates significant performance improvements over existing GRPO approaches, achieving a 13.34-point absolute gain on AIME 2025 benchmarks.

AIBullisharXiv – CS AI · May 127/10

🧠

Skill-R1: Agent Skill Evolution via Reinforcement Learning

Skill-R1 introduces a reinforcement learning framework that optimizes reusable natural language procedures (skills) for large language model agents without modifying the underlying model itself. By training a lightweight skill generator that works with frozen LLMs, the approach reduces adaptation costs while maintaining compatibility with both open and closed-source models, demonstrating consistent improvements on complex multi-step tasks.

AIBullisharXiv – CS AI · May 117/10

🧠

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

Researchers introduce Distribution Guided Policy Optimization (DGPO), a novel reinforcement learning framework that improves how large language models learn to perform complex reasoning tasks by assigning credit at the token level rather than sequence level. DGPO replaces unstable KL divergence penalties with bounded Hellinger distance and adds an entropy gating mechanism, achieving state-of-the-art performance on challenging math benchmarks like AIME2024 and AIME2025.

AIBullisharXiv – CS AI · May 97/10

🧠

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

Researchers introduce Asymmetric Group Policy Optimization (AGPO), a reinforcement learning method that improves LLM reasoning by preventing capability collapse while focusing on rare correct solutions. The technique demonstrates state-of-the-art performance on mathematical benchmarks and has been deployed in JD's search ads relevance system, showing practical industrial applications.

AIBullisharXiv – CS AI · May 47/10

🧠

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Researchers introduce Preference Goal Tuning (PGT), a novel post-training framework that optimizes goal embeddings as continuous control variables rather than updating frozen policy parameters. Testing on Minecraft SkillForge demonstrates PGT achieves 72-81% relative improvements over expert-crafted prompts while showing superior generalization in out-of-distribution settings compared to traditional fine-tuning.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Variance-Aware Prior-Based Tree Policies for Monte Carlo Tree Search

Researchers introduce Inverse-RPO, a methodology for deriving prior-based tree policies in Monte Carlo Tree Search from first principles, and apply it to create variance-aware UCT algorithms that outperform PUCT without additional computational overhead. This advances the theoretical foundation of MCTS used in reinforcement learning systems like AlphaZero.

AIBullisharXiv – CS AI · Apr 137/10

🧠

SafeAdapt: Provably Safe Policy Updates in Deep Reinforcement Learning

Researchers introduce SafeAdapt, a novel framework for updating reinforcement learning policies while maintaining provable safety guarantees across changing environments. The approach uses a 'Rashomon set' to identify safe parameter regions and projects policy updates onto this certified space, addressing the critical challenge of deploying RL agents in safety-critical applications where dynamics and objectives evolve over time.

Page 1 of 6Next →