AINeutralarXiv – CS AI · May 126/10
🧠Researchers propose IMAX, a framework that uses trainable prefix tuning to improve exploration in reinforcement learning with verifiable rewards (RLVR) for language model reasoning. The approach addresses entropy collapse by creating diverse reasoning trajectories, achieving performance gains up to 11.60% in Pass@4 accuracy across multiple model scales.
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce AIPO, a reinforcement learning framework that enhances large language model reasoning by enabling active consultation with collaborative agents during training. The method addresses exploration limitations in current RL approaches and demonstrates consistent performance improvements across multiple mathematical and coding benchmarks.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce Structured Role-Aware Policy Optimization (SRPO), a reinforcement learning method that improves multimodal AI reasoning by assigning credit to different token types based on their functional roles. The approach enhances vision-language models' ability to ground answers in visual evidence without requiring external reward models, advancing more reliable multimodal reasoning systems.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose a new approach to entropy control in Reinforcement Learning with Verifiable Rewards (RLVR) for Large Language Models, addressing the problem of policy entropy collapse through dynamic gradient-preserving clipping mechanisms. The method uses importance sampling analysis and dynamic thresholds to maintain output diversity and prevent vanishing gradients during training, demonstrating improved performance across benchmarks.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose Listwise Policy Optimization (LPO), a new framework for training large language models that improves upon existing reinforcement learning approaches by explicitly projecting policies toward target distributions on the response simplex. The method demonstrates consistent performance improvements across reasoning tasks while maintaining training stability and response diversity.
AINeutralarXiv – CS AI · May 96/10
🧠A new research paper identifies implicit reward overfitting in Reinforcement Learning with Verifiable Rewards (RLVR), revealing that model improvements concentrate in rank-1 components while potentially sacrificing broader knowledge retention. The findings suggest RLVR optimizes singular spectrum distributions rather than general reasoning, with implications for improving AI training paradigms and continual learning approaches.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers provide theoretical foundations for Reinforcement Learning with Verifiable Rewards (RLVR), a technique for post-training large language models using binary feedback. The analysis introduces the 'Gradient Gap' concept to explain convergence dynamics and derives critical step-size thresholds that determine whether training succeeds or fails, with implications for practical implementations like length normalization.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce PerMix-RLVR, a training method that enables large language models to maintain persona flexibility while preserving task robustness. The approach addresses a fundamental trade-off in reinforcement learning with verifiable rewards, where models become less responsive to persona prompts but gain improved performance on objective tasks.
AIBullisharXiv – CS AI · Apr 106/10
🧠Researchers introduce RePro, a novel post-training technique that optimizes large language models' reasoning processes by framing chain-of-thought as gradient descent and using process-level rewards to reduce overthinking. The method demonstrates consistent performance improvements across mathematics, science, and coding benchmarks while mitigating inefficient reasoning behaviors in LLMs.
AIBullisharXiv – CS AI · Mar 126/10
🧠Researchers introduce CLIPO (Contrastive Learning in Policy Optimization), a new method that improves upon Reinforcement Learning with Verifiable Rewards (RLVR) for training Large Language Models. CLIPO addresses hallucination and answer-copying issues by incorporating contrastive learning to better capture correct reasoning patterns across multiple solution paths.
AIBullisharXiv – CS AI · Mar 36/103
🧠Researchers propose Quantile Advantage Estimation (QAE) to stabilize Reinforcement Learning with Verifiable Rewards (RLVR) for large language model reasoning. The method replaces mean baselines with group-wise K-quantile baselines to prevent entropy collapse and explosion, showing sustained improvements on mathematical reasoning tasks.
AIBullisharXiv – CS AI · Mar 26/1014
🧠Researchers propose SCOPE, a new framework for Reinforcement Learning from Verifiable Rewards (RLVR) that improves AI reasoning by salvaging partially correct solutions rather than discarding them entirely. The method achieves 46.6% accuracy on math reasoning tasks and 53.4% on out-of-distribution problems by using step-wise correction to maintain exploration diversity.