y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#reward-models News & Analysis

15 articles tagged with #reward-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

15 articles
AIBullisharXiv โ€“ CS AI ยท Mar 47/103
๐Ÿง 

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Researchers introduce Skywork-Reward-V2, a suite of AI reward models trained on SynPref-40M, a massive 40-million preference pair dataset created through human-AI collaboration. The models achieve state-of-the-art performance across seven major benchmarks by combining human annotation quality with AI scalability for better preference learning.

AINeutralarXiv โ€“ CS AI ยท Mar 37/103
๐Ÿง 

Reward Models Inherit Value Biases from Pretraining

A comprehensive study of 10 leading reward models reveals they inherit significant value biases from their base language models, with Llama-based models preferring 'agency' values while Gemma-based models favor 'communion' values. This bias persists even when using identical preference data and training processes, suggesting that the choice of base model fundamentally shapes AI alignment outcomes.

AIBullisharXiv โ€“ CS AI ยท Feb 277/106
๐Ÿง 

Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

Researchers introduce Dual-Iterative Preference Optimization (Dual-IPO), a new method that iteratively improves both reward models and video generation models to create higher-quality AI-generated videos better aligned with human preferences. The approach enables smaller 2B parameter models to outperform larger 5B models without requiring manual preference annotations.

AINeutralarXiv โ€“ CS AI ยท Feb 277/107
๐Ÿง 

Learning to Answer from Correct Demonstrations

Researchers propose a new approach for training AI models to generate correct answers from demonstrations, using imitation learning in contextual bandits rather than traditional supervised fine-tuning. The method achieves better sample complexity and works with weaker assumptions about the underlying reward model compared to existing likelihood-maximization approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

Researchers have developed EvolvR, a self-evolving framework that improves AI's ability to evaluate and generate stories through pairwise reasoning and multi-agent data filtering. The system achieves state-of-the-art performance on three evaluation benchmarks and significantly enhances story generation quality when used as a reward model.

AIBullisharXiv โ€“ CS AI ยท Mar 166/10
๐Ÿง 

AdaBoN: Adaptive Best-of-N Alignment

Researchers propose AdaBoN, an adaptive Best-of-N alignment method that improves computational efficiency in language model alignment by allocating inference-time compute based on prompt difficulty. The two-stage algorithm outperforms uniform allocation strategies while using 20% less computational budget.

AINeutralarXiv โ€“ CS AI ยท Mar 36/1012
๐Ÿง 

RubricBench: Aligning Model-Generated Rubrics with Human Standards

RubricBench is a new benchmark with 1,147 pairwise comparisons designed to evaluate rubric-based assessment methods for Large Language Models. Research reveals a significant gap between human-annotated and AI-generated rubrics, showing that current state-of-the-art models struggle to autonomously create valid evaluation criteria.

AINeutralarXiv โ€“ CS AI ยท Mar 26/1010
๐Ÿง 

RewardUQ: A Unified Framework for Uncertainty-Aware Reward Models

Researchers introduce RewardUQ, a unified framework for evaluating uncertainty quantification in reward models used to align large language models with human preferences. The study finds that model size and initialization have the most significant impact on performance, while providing an open-source Python package to advance the field.

AIBullisharXiv โ€“ CS AI ยท Mar 27/1015
๐Ÿง 

Real-Time Aligned Reward Model beyond Semantics

Researchers introduce R2M (Real-Time Aligned Reward Model), a new framework for Reinforcement Learning from Human Feedback (RLHF) that addresses reward overoptimization in large language models. The system uses real-time policy feedback to better align reward models with evolving policy distributions during training.

AINeutralarXiv โ€“ CS AI ยท Mar 27/1015
๐Ÿง 

What Makes a Reward Model a Good Teacher? An Optimization Perspective

Research reveals that reward model accuracy alone doesn't determine effectiveness in RLHF systems. The study proves that low reward variance can create flat optimization landscapes, making even perfectly accurate reward models inefficient teachers that underperform less accurate models with higher variance.

AINeutralarXiv โ€“ CS AI ยท Mar 95/10
๐Ÿง 

Revisiting the (Sub)Optimality of Best-of-N for Inference-Time Alignment

Researchers revisited Best-of-N (BoN) sampling for AI alignment and found it's actually optimal when evaluated using win-rate metrics rather than expected true reward. They propose a variant that eliminates reward-hacking vulnerabilities while maintaining optimal performance.

AINeutralarXiv โ€“ CS AI ยท Mar 34/106
๐Ÿง 

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Researchers introduce CMI-RewardBench, a comprehensive evaluation framework for music generation AI models that can process multimodal inputs including text, lyrics, and audio. The system includes a 110k sample preference dataset and reward models that show strong correlation with human judgments for music quality assessment.

AINeutralOpenAI News ยท Oct 191/107
๐Ÿง 

Scaling laws for reward model overoptimization

The article appears to discuss scaling laws related to reward model overoptimization in AI systems. However, the article body is empty, making it impossible to provide meaningful analysis of the content or implications.