#reward-signals News & Analysis

3 articles tagged with #reward-signals. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

VRPO: Rethinking Value Modeling for Robust RL under Noisy Supervision in LLM Post-Training

Researchers propose VRPO, a reinforcement learning framework that strengthens value modeling to handle noisy reward signals in large language model post-training. The approach uses auxiliary losses and information bottleneck techniques to enable value models to filter noise and generate more reliable advantage estimates, outperforming standard methods like PPO and GRPO across dialogue, math, and QA tasks.

🏢 Perplexity

AIBullisharXiv – CS AI · May 297/10

🧠

Label-Free Reinforcement Learning via Cross-Model Entropy

Researchers propose Cross-Model Entropy (CME), a label-free reward signal for reinforcement learning that uses a separate verifier model's likelihood assessment instead of human labels or self-referential signals. The method successfully extends RL post-training to open-ended instruction following across multiple model families, achieving win rates of 52.5-71.4% in head-to-head comparisons.

🧠 Llama

AINeutralarXiv – CS AI · May 286/10

🧠

Quantifying Empirical Compute-Supervision Tradeoffs in RLVR

Researchers empirically tested whether increased compute can overcome imperfect verifier performance in reinforcement learning from verifiable rewards (RLVR), finding that verifier quality and training compute are not interchangeable. The study reveals that false negatives degrade model performance more severely than false positives, and compute scaling alone cannot close performance gaps caused by supervision noise.