#on-policy-learning News & Analysis

9 articles tagged with #on-policy-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBullisharXiv – CS AI · Jun 87/10

🧠

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Researchers introduce On-Policy Diffusion Language Models (OPDLM), a technique that converts autoregressive language models into diffusion models using 15-7,000x fewer training tokens. The method addresses fundamental efficiency problems by eliminating train-inference mismatches and preserving knowledge from the original model through on-policy distillation.

AIBullisharXiv – CS AI · May 287/10

🧠

VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization

Researchers introduce VULPO, an on-policy LLM optimization framework for vulnerability detection that achieves 203% improvement over baseline models by incorporating context-aware reasoning and multidimensional reward signals. The approach combines a new ContextVul dataset with specialized fine-tuning to create more effective security analysis tools that reason through complex code interactions.

AIBullisharXiv – CS AI · May 277/10

🧠

Less is More: Early Stopping Rollout for On-Policy Distillation

Researchers propose Early Stopping Rollout (ESR), a novel distillation technique that improves on-policy student model training by limiting rollout generation to initial response tokens. The method addresses "Off-policy Teacher Decay," where teachers lose effectiveness on later tokens, achieving better performance with higher GPU efficiency than standard approaches.

AIBullisharXiv – CS AI · May 117/10

🧠

Rubric-based On-policy Distillation

Researchers introduce ROPD, a rubric-based on-policy distillation framework that replaces teacher logits with structured semantic rubrics for model alignment. The approach achieves up to 10x better sample efficiency than logit-based methods while enabling distillation from proprietary black-box LLMs, addressing a critical scalability limitation in current model training.

AINeutralarXiv – CS AI · Jun 116/10

🧠

When Context Returns: Toward Robust Internalization in On-Policy Distillation

Researchers identify a critical failure mode in on-policy distillation where reintroducing privileged context (like system prompts) to a distilled student model degrades performance, even on previously solved tasks. They propose a lightweight consistency regularizer using stop-gradient anchoring and forward KL divergence to achieve 'context removability,' enabling models to internalize context while remaining stable when it reappears.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Beyond Absolute Imitation: Anchored Residual Guidance for Privileged On-Policy Distillation

Researchers introduce Anchored Residual On-Policy Distillation (AR-OPD), a new framework for training smaller language models that improves upon existing privileged distillation methods by separating locally reachable reasoning from oracle guidance. The approach achieves 2.3-point gains over full privileged distillation and 7.9-point gains over standard supervised fine-tuning, with significant improvements on long-horizon reasoning tasks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Trajectory-Refined Distillation

Researchers propose Trajectory-Refined Distillation (TRD), a novel training method that addresses structural failures in on-policy distillation for large language models by correcting problematic rollouts at the trajectory level rather than token level. TRD demonstrates consistent improvements across benchmarks by mitigating prefix failure and exposing models to alternative valid reasoning paths during training.

AINeutralarXiv – CS AI · Jun 56/10

🧠

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

Researchers introduce ViCuR, a visual-grounded distillation framework that improves multimodal AI reasoning by using recoverable visual cues instead of answer-dependent privileges. The approach achieves consistent performance gains across seven benchmarks with Qwen3-VL models by eliminating train-test mismatches that encourage shortcut learning rather than genuine visual understanding.

AINeutralarXiv – CS AI · May 276/10

🧠

Not All Transitions Matter: Evidence from PPO

Researchers propose a simple technique for stabilizing reinforcement learning training in PPO algorithms by randomly dropping 25% of transitions during rollouts. The method removes gradient redundancy caused by causally-dependent state sequences, improving training consistency across multiple environments without algorithmic modifications.