10 articles tagged with #reward-modeling. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 3d ago7/10
๐ง Researchers propose a label-free self-supervised reinforcement learning framework that enables language models to follow complex multi-constraint instructions without external supervision. The approach derives reward signals directly from instructions and uses constraint decomposition strategies to address sparse reward challenges, demonstrating strong performance across both in-domain and out-of-domain instruction-following tasks.
AIBullisharXiv โ CS AI ยท 5d ago7/10
๐ง Researchers introduce a listener-augmented reinforcement learning framework for training vision-language models to better align with human visual preferences. By using an independent frozen model to evaluate and validate reasoning chains, the approach achieves 67.4% accuracy on ImageReward benchmarks and demonstrates significant improvements in out-of-distribution generalization.
๐ข Hugging Face
AIBullisharXiv โ CS AI ยท Mar 97/10
๐ง Researchers introduce RM-R1, a new class of Reasoning Reward Models (ReasRMs) that integrate chain-of-thought reasoning into reward modeling for large language models. The models outperform much larger competitors including GPT-4o by up to 4.9% across reward model benchmarks by using a chain-of-rubrics mechanism and two-stage training process.
๐ง GPT-4๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers propose a new framework called Critic Rubrics to bridge the gap between academic coding agent benchmarks and real-world applications. The system learns from sparse, noisy human interaction data using 24 behavioral features and shows significant improvements in code generation tasks including 15.9% better reranking performance on SWE-bench.
AIBullisharXiv โ CS AI ยท Mar 47/104
๐ง Researchers present a new mathematical framework for training AI reward models using Likert scale preferences instead of simple binary comparisons. The approach uses ordinal regression to better capture nuanced human feedback, outperforming existing methods across chat, reasoning, and safety benchmarks.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduce Robometer, a new framework for training robot reward models that combines progress tracking with trajectory comparisons to better learn from failed attempts. The system is trained on RBM-1M, a dataset of over one million robot trajectories including failures, and shows improved performance across diverse robotics applications.
AIBullisharXiv โ CS AI ยท 4d ago6/10
๐ง Researchers propose CPMI, an automated method for training process reward models that reduces annotation costs by 84% and computational overhead by 98% compared to traditional Monte Carlo approaches. The technique uses contrastive mutual information to assign reward scores to reasoning steps in AI chain-of-thought trajectories without expensive human annotation or repeated LLM rollouts.
AIBullisharXiv โ CS AI ยท Mar 166/10
๐ง Researchers introduce Visual-ERM, a multimodal reward model that improves vision-to-code tasks by evaluating visual equivalence in rendered outputs rather than relying on text-based rules. The system achieves significant performance gains on chart-to-code tasks (+8.4) and shows consistent improvements across table and SVG parsing applications.
AINeutralarXiv โ CS AI ยท Mar 36/103
๐ง Researchers propose rubric-based reward modeling to address reward over-optimization in large language model fine-tuning. The approach focuses on the high-reward tail where models struggle to distinguish excellent responses from merely great ones, using off-policy examples to improve training effectiveness.
AIBullisharXiv โ CS AI ยท Feb 276/107
๐ง Researchers propose ContextRL, a new framework that uses context augmentation to improve machine learning model efficiency in knowledge discovery. The framework enables smaller models like Qwen3-VL-8B to achieve performance comparable to much larger 32B models through enhanced reward modeling and multi-turn sampling strategies.