#rlvr News & Analysis

34 articles tagged with #rlvr. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

34 articles

AIBullisharXiv – CS AI · 5d ago7/10

🧠

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

Researchers propose RECAP, a dynamic reweighting strategy that preserves general AI capabilities while improving reasoning performance in large language models trained with reinforcement learning. The method addresses a critical problem where models forget foundational skills like perception and faithfulness during post-training optimization on reasoning tasks.

AIBullisharXiv – CS AI · Jun 97/10

🧠

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

Researchers introduce sGPO (sorted Group Policy Optimization), a training method that reduces computational waste in reinforcement learning by using cheap inference to profile query difficulty and dynamically allocate training resources. The approach achieves 3x reduction in total training compute while maintaining or improving performance, representing a significant efficiency breakthrough for large-scale AI model training.

AINeutralarXiv – CS AI · Jun 57/10

🧠

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

Researchers present a pre-registered causal decomposition framework that reveals how reinforcement learning from verifiable rewards (RLVR) conflates self-consistency elicitation with genuine reward-design effects. Through controlled experiments, they demonstrate that naive performance metrics systematically overestimate reward-design impact by 50-95%, with elicitation dominating in weak-prior regimes. The work provides diagnostic tools to audit published alignment research and expose methodological confounds.

AIBullisharXiv – CS AI · Jun 57/10

🧠

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

SUPERNOVA introduces a framework for extending reinforcement learning with verifiable rewards (RLVR) beyond STEM fields by systematically curating data from natural instruction datasets. A 25K-instance dataset trained on smaller models achieves 64.4 percentage point gains on complex reasoning benchmarks, with improvements generalizing across model scales and families.

AINeutralarXiv – CS AI · Jun 27/10

🧠

Before the Model Learns the Bug:Fuzzing RLVR Verifiers

Researchers present a fuzzing framework to test verifiers used in Reinforcement Learning with Verifiable Rewards (RLVR), a system that replaces human feedback with automated reward functions like code validators. The study identifies a critical vulnerability: when verifiers contain bugs, AI models can learn and exploit those bugs during optimization, creating a new failure mode in AI safety.

AIBullisharXiv – CS AI · Jun 27/10

🧠

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

Researchers propose POPO (Group Prioritized Off-Policy Optimization), a new framework that improves reinforcement learning for large language model reasoning by efficiently reusing ineffective training samples without computational overhead. The method addresses a critical limitation in RLVR systems where many training samples yield zero-variance rewards, enabling faster model improvement across mathematics, planning, and visual reasoning tasks.

AIBullisharXiv – CS AI · Jun 17/10

🧠

EchoRL: Reinforcement Learning via Rollout Echoing

EchoRL introduces a novel technique to overcome learning signal collapse in reinforcement learning systems training large language models. By leveraging entropy patterns from expert trajectories to extract value from otherwise degenerated rollouts, the method achieves consistent performance improvements across multiple benchmarks and LLM architectures with minimal computational overhead.

AIBullisharXiv – CS AI · May 117/10

🧠

Adaptive Negative Reinforcement for LLM Reasoning:Dynamically Balancing Correction and Diversity in RLVR

Researchers propose Adaptive Negative Sample Reinforcement (A-NSR) and Confidence-Weighted Negative Reinforcement (CW-NSR) to improve LLM reasoning by dynamically adjusting penalty weights during training rather than applying fixed penalties. The methods are evaluated on challenging math datasets using Qwen2.5-Math-1.5B, demonstrating that intelligent error correction can match or exceed complex frameworks like PPO.

AIBullisharXiv – CS AI · May 97/10

🧠

Beyond Uniform Credit Assignment: Selective Eligibility Traces for RLVR

Researchers propose Selective Eligibility Traces (S-trace), a new method for reinforcement learning that improves credit assignment in large language models by selectively identifying critical reasoning steps rather than uniformly crediting entire trajectories. The approach demonstrates performance gains of 0.49-3.16% across Qwen models while improving sample and token efficiency compared to existing critic-free algorithms.

AIBullisharXiv – CS AI · May 97/10

🧠

Emergent Slow Thinking in LLMs as Inverse Tree Freezing

Researchers present a statistical-physics framework explaining how large language models develop multi-step reasoning through reinforcement learning with verifiable rewards (RLVR), modeling the process as inverse tree freezing in a concept network. They propose Annealed-RLVR, a timing-optimized training method that outperforms standard RLVR by applying supervised fine-tuning at peak frustration rather than after convergence, preventing policy collapse.

AIBullisharXiv – CS AI · May 77/10

🧠

The Implicit Curriculum: Learning Dynamics in RL with Verifiable Rewards

Researchers develop a theoretical framework explaining how reinforcement learning with verifiable rewards (RLVR) enables long-horizon reasoning in large language models through an implicit curriculum effect. The analysis reveals that mixed-difficulty training naturally progresses from easy to hard problems without explicit scheduling, with learning dynamics determined by the smoothness of the difficulty spectrum.

AIBearisharXiv – CS AI · Apr 147/10

🧠

Backdoors in RLVR: Jailbreak Backdoors in LLMs From Verifiable Reward

Researchers have discovered a critical vulnerability in Reinforcement Learning with Verifiable Rewards (RLVR), an emerging training paradigm that enhances LLM reasoning abilities. By injecting less than 2% poisoned data into training sets, attackers can implant backdoors that degrade safety performance by 73% when triggered, without modifying the reward verifier itself.

AINeutralarXiv – CS AI · Mar 127/10

🧠

Does LLM Alignment Really Need Diversity? An Empirical Study of Adapting RLVR Methods for Moral Reasoning

A comprehensive study comparing reinforcement learning approaches for AI alignment finds that diversity-seeking algorithms don't outperform reward-maximizing methods in moral reasoning tasks. The research demonstrates that moral reasoning has more concentrated high-reward distributions than mathematical reasoning, making standard optimization methods equally effective without explicit diversity mechanisms.

AINeutralarXiv – CS AI · Mar 57/10

🧠

Generalization of RLVR Using Causal Reasoning as a Testbed

Researchers studied reinforcement learning with verifiable rewards (RLVR) for training large language models on causal reasoning tasks, finding it outperforms supervised fine-tuning but only when models have sufficient initial competence. The study used causal graphical models as a testbed and showed RLVR improves specific reasoning subskills like marginalization strategy and probability calculations.

AINeutralarXiv – CS AI · Jun 96/10

🧠

PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

Researchers propose Position-Aware Entropy Calibration (PAEC), a novel technique that selectively manages entropy in reinforcement learning systems used to improve large language model reasoning. The method addresses policy-entropy collapse by applying targeted entropy penalties only at decision-critical token positions rather than uniformly across all tokens, demonstrating improved performance on mathematical reasoning benchmarks.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

Researchers propose PivotTrace, a data-efficient framework for training large reasoning models that selects unlabeled samples for annotation without prior supervision. The method achieves 29.3% annotation efficiency while converging 2.75x faster than standard supervised approaches by leveraging attention dynamics to quantify uncertainty.

AINeutralarXiv – CS AI · Jun 26/10

🧠

CAST: Non-Privileged Clipped Asymmetric Self-Teaching with Advantage Flipping for GRPO

Researchers propose CAST, a new self-distillation method for reinforcement learning in large language models that improves upon existing approaches by using answer-free teacher scoring and bidirectional advantage flipping. The method addresses limitations in Group Relative Policy Optimization (GRPO) by providing denser token-level guidance while maintaining alignment with trajectory correctness, demonstrating improvements in mathematical reasoning tasks.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Annealed Softmax Greedy in Many-Armed Bayesian Bandits

This paper analyzes why reinforcement learning methods that update policies based on reward signals without explicitly tracking uncertainty can still be effective. Researchers prove that annealed softmax policies achieve near-optimal regret rates in many-armed Bayesian bandit settings when many near-optimal actions exist, providing theoretical justification for uncertainty-agnostic approaches used in modern language model training.

AINeutralarXiv – CS AI · May 285/10

🧠

Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR

Researchers introduce REFT, a method that improves Reinforcement Learning with Verifiable Rewards (RLVR) by diversifying the first token generated after reasoning markers, addressing a previously overlooked bottleneck in rollout diversity. The technique achieves measurable improvements across multiple model sizes and difficulty levels without requiring changes to existing RLVR pipelines.

AINeutralarXiv – CS AI · May 286/10

🧠

IRDS: Interpretable RLVR Data Selection via Verifier-Coupled Sparse Autoencoder Coverage

IRDS introduces a new data selection method for reinforcement learning with verifiable rewards (RLVR) that uses sparse autoencoders to identify interpretable, high-value training instances. The approach achieves significant accuracy improvements on math reasoning benchmarks while reducing computational costs by an order of magnitude compared to existing methods.

🧠 Llama

AINeutralarXiv – CS AI · May 286/10

🧠

Quantifying Empirical Compute-Supervision Tradeoffs in RLVR

Researchers empirically tested whether increased compute can overcome imperfect verifier performance in reinforcement learning from verifiable rewards (RLVR), finding that verifier quality and training compute are not interchangeable. The study reveals that false negatives degrade model performance more severely than false positives, and compute scaling alone cannot close performance gaps caused by supervision noise.

AINeutralarXiv – CS AI · May 276/10

🧠

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

Researchers conducted a controlled study on reinforcement learning with verifiable rewards (RLVR) for reasoning models, revealing that training data allocation across multiple reasoning dimensions—depth, environment complexity, and reasoning types—significantly impacts model performance. The study found that joint coverage of these dimensions outperforms single-axis training approaches, and that models exhibit systematic weaknesses in abductive reasoning regardless of training setup.

AINeutralarXiv – CS AI · May 126/10

🧠

How You Begin is How You Reason: Driving Exploration in RLVR via Prefix-Tuned Priors

Researchers propose IMAX, a framework that uses trainable prefix tuning to improve exploration in reinforcement learning with verifiable rewards (RLVR) for language model reasoning. The approach addresses entropy collapse by creating diverse reasoning trajectories, achieving performance gains up to 11.60% in Pass@4 accuracy across multiple model scales.

AINeutralarXiv – CS AI · May 126/10

🧠

AIPO: : Learning to Reason from Active Interaction

Researchers introduce AIPO, a reinforcement learning framework that enhances large language model reasoning by enabling active consultation with collaborative agents during training. The method addresses exploration limitations in current RL approaches and demonstrates consistent performance improvements across multiple mathematical and coding benchmarks.

AINeutralarXiv – CS AI · May 116/10

🧠

Structured Role-Aware Policy Optimization for Multimodal Reasoning

Researchers introduce Structured Role-Aware Policy Optimization (SRPO), a reinforcement learning method that improves multimodal AI reasoning by assigning credit to different token types based on their functional roles. The approach enhances vision-language models' ability to ground answers in visual evidence without requiring external reward models, advancing more reliable multimodal reasoning systems.

Page 1 of 2Next →