#reward-modeling News & Analysis

22 articles tagged with #reward-modeling. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

22 articles

AIBullisharXiv – CS AI · May 127/10

🧠

Auto-Rubric as Reward: From Implicit Preferences to Explicit Multimodal Generative Criteria

Researchers introduce Auto-Rubric as Reward (ARR), a framework that replaces opaque scalar reward signals in multimodal AI alignment with explicit, structured criteria-based evaluation. By externalizing a model's implicit preferences into interpretable rubrics before comparison, ARR reduces evaluation bias and enables more reliable human-preference alignment in generative models.

AIBullisharXiv – CS AI · May 127/10

🧠

RewardHarness: Self-Evolving Agentic Post-Training

RewardHarness introduces a self-evolving agentic framework that dramatically improves reward modeling for image-editing evaluation using only 0.05% of typical training data. By iteratively refining tools and skills from minimal examples rather than large-scale annotations, the system achieves 47.4% accuracy on benchmarks, outperforming GPT-5 and enabling more efficient AI alignment.

🧠 GPT-5

AIBullisharXiv – CS AI · May 117/10

🧠

Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

Researchers introduce rubric-grounded reinforcement learning, a framework that trains AI models using structured, multi-criterion rewards from an LLM judge rather than binary outcomes. Training Llama-3.1-8B on scientific documents achieved 71.7% normalized reward and demonstrated improved performance on multiple reasoning benchmarks, suggesting that document-grounded training signals can produce generalizable reasoning capabilities.

🧠 Llama

AIBullisharXiv – CS AI · May 97/10

🧠

CAMEL: Confidence-Gated Reflection for Reward Modeling

Researchers propose CAMEL, a new reward modeling framework that combines efficient single-token preference decisions with selective reflection for low-confidence cases, achieving 82.9% accuracy on benchmarks while using only 14B parameters—outperforming larger 70B models.

AIBullisharXiv – CS AI · Apr 207/10

🧠

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

Researchers introduce AgentV-RL, an agentic verifier framework that enhances reward modeling for large language models by combining bidirectional reasoning agents with tool-use capabilities. The system addresses critical limitations in LLM verification by enabling forward and backward tracing of solutions, achieving 25.2% performance gains over existing methods and positioning agentic reward modeling as a promising new paradigm.

AIBullisharXiv – CS AI · Apr 157/10

🧠

Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Researchers propose a label-free self-supervised reinforcement learning framework that enables language models to follow complex multi-constraint instructions without external supervision. The approach derives reward signals directly from instructions and uses constraint decomposition strategies to address sparse reward challenges, demonstrating strong performance across both in-domain and out-of-domain instruction-following tasks.

AIBullisharXiv – CS AI · Apr 137/10

🧠

Listener-Rewarded Thinking in VLMs for Image Preferences

Researchers introduce a listener-augmented reinforcement learning framework for training vision-language models to better align with human visual preferences. By using an independent frozen model to evaluate and validate reasoning chains, the approach achieves 67.4% accuracy on ImageReward benchmarks and demonstrates significant improvements in out-of-distribution generalization.

🏢 Hugging Face

AIBullisharXiv – CS AI · Mar 97/10

🧠

RM-R1: Reward Modeling as Reasoning

Researchers introduce RM-R1, a new class of Reasoning Reward Models (ReasRMs) that integrate chain-of-thought reasoning into reward modeling for large language models. The models outperform much larger competitors including GPT-4o by up to 4.9% across reward model benchmarks by using a chain-of-rubrics mechanism and two-stage training process.

🧠 GPT-4🧠 Llama

AIBullisharXiv – CS AI · Mar 56/10

🧠

A Rubric-Supervised Critic from Sparse Real-World Outcomes

Researchers propose a new framework called Critic Rubrics to bridge the gap between academic coding agent benchmarks and real-world applications. The system learns from sparse, noisy human interaction data using 24 behavioral features and shows significant improvements in code generation tasks including 15.9% better reranking performance on SWE-bench.

AIBullisharXiv – CS AI · Mar 47/104

🧠

Beyond Binary Preferences: A Principled Framework for Reward Modeling with Ordinal Feedback

Researchers present a new mathematical framework for training AI reward models using Likert scale preferences instead of simple binary comparisons. The approach uses ordinal regression to better capture nuanced human feedback, outperforming existing methods across chat, reasoning, and safety benchmarks.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Robometer: Scaling General-Purpose Robotic Reward Models via Trajectory Comparisons

Researchers introduce Robometer, a new framework for training robot reward models that combines progress tracking with trajectory comparisons to better learn from failed attempts. The system is trained on RBM-1M, a dataset of over one million robot trajectories including failures, and shows improved performance across diverse robotics applications.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

In-Context Reward Adaptation for Robust Preference Modeling

Researchers propose In-Context Reward Adaptation, a transformer-based framework that dynamically models diverse human preferences without costly retraining. By incorporating human response time as an auxiliary signal, the approach enables language models to adapt to unseen preference domains on-the-fly, addressing a critical limitation of static reward models used in RLHF systems.

AIBullisharXiv – CS AI · 6d ago6/10

🧠

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Researchers propose Tournament-GRPO, a novel reinforcement learning framework that uses group-wise tournament comparisons instead of absolute scoring to improve long-form text generation. By converting rubric-based LLM judgments into relative rewards through competitive rankings, the method achieves 4.52-point improvements over existing approaches on Deep Research Bench benchmarks.

AINeutralarXiv – CS AI · May 126/10

🧠

BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

BoostAPR is a new AI framework that improves automated program repair by using dual reward models and reinforcement learning to identify which code edits actually fix bugs. The system achieves significant improvements on multiple benchmarks, including 40.7% on SWE-bench Verified, demonstrating that more granular feedback mechanisms can substantially enhance AI's ability to repair software vulnerabilities.

AINeutralarXiv – CS AI · May 116/10

🧠

Mitigating Cognitive Bias in RLHF by Altering Rationality

Researchers propose a method to improve RLHF (Reinforcement Learning from Human Feedback) by treating the rationality parameter as context-dependent rather than fixed, using an LLM-as-judge to detect cognitive biases in human annotations and downweight unreliable comparisons. This approach enables training more robust AI models even when human feedback contains systematic biases.

AIBullisharXiv – CS AI · May 76/10

🧠

Efficiently Aligning Language Models with Online Natural Language Feedback

Researchers have developed methods to efficiently align language models using online natural language feedback in domains where human supervision is limited and difficult to quantify. By iteratively optimizing proxy reward models and collecting fresh expert feedback, the approach recovers 80-100% of full-supervision performance with 3-20x fewer expert samples, demonstrating significant improvements in training data efficiency.

🧠 Haiku

AINeutralarXiv – CS AI · May 46/10

🧠

PORTool: Importance-Aware Policy Optimization with Rewarded Tree for Multi-Tool-Integrated Reasoning

PORTool is a new policy-optimization algorithm that improves how AI agents learn to use external tools by solving the credit-assignment problem in multi-step reasoning tasks. The method uses a rewarded tree structure to assign rewards at individual steps rather than only at outcomes, enabling agents to achieve higher accuracy while reducing unnecessary tool calls.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Reward Weighted Classifier-Free Guidance as Policy Improvement in Autoregressive Models

Researchers demonstrate that reward-weighted classifier-free guidance (RCFG) can dynamically adjust autoregressive model outputs to optimize arbitrary reward functions at test time without retraining. Applied to molecular generation, this approach enables real-time optimization of competing objectives and accelerates reinforcement learning convergence when used as a teacher for policy distillation.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Efficient Process Reward Modeling via Contrastive Mutual Information

Researchers propose CPMI, an automated method for training process reward models that reduces annotation costs by 84% and computational overhead by 98% compared to traditional Monte Carlo approaches. The technique uses contrastive mutual information to assign reward scores to reasoning steps in AI chain-of-thought trajectories without expensive human annotation or repeated LLM rollouts.

AIBullisharXiv – CS AI · Mar 166/10

🧠

Visual-ERM: Reward Modeling for Visual Equivalence

Researchers introduce Visual-ERM, a multimodal reward model that improves vision-to-code tasks by evaluating visual equivalence in rendered outputs rather than relying on text-based rules. The system achieves significant performance gains on chart-to-code tasks (+8.4) and shows consistent improvements across table and SVG parsing applications.

AINeutralarXiv – CS AI · Mar 36/103

🧠

Chasing the Tail: Effective Rubric-based Reward Modeling for Large Language Model Post-Training

Researchers propose rubric-based reward modeling to address reward over-optimization in large language model fine-tuning. The approach focuses on the high-reward tail where models struggle to distinguish excellent responses from merely great ones, using off-policy examples to improve training effectiveness.

AIBullisharXiv – CS AI · Feb 276/107

🧠

ContextRL: Enhancing MLLM's Knowledge Discovery Efficiency with Context-Augmented RL

Researchers propose ContextRL, a new framework that uses context augmentation to improve machine learning model efficiency in knowledge discovery. The framework enables smaller models like Qwen3-VL-8B to achieve performance comparable to much larger 32B models through enhanced reward modeling and multi-turn sampling strategies.