#reward-learning News & Analysis

10 articles tagged with #reward-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

10 articles

AIBullisharXiv – CS AI · Apr 147/10

🧠

TimeRewarder: Learning Dense Reward from Passive Videos via Frame-wise Temporal Distance

TimeRewarder is a new machine learning method that learns dense reward signals from passive videos to improve reinforcement learning in robotics. By modeling temporal distances between video frames, the approach achieves 90% success rates on Meta-World tasks using significantly fewer environment interactions than prior methods, while also leveraging human videos for scalable reward learning.

AINeutralarXiv – CS AI · Jun 236/10

🧠

MAVRL: Learning Reward Functions from Multiple Feedback Types with Amortized Variational Inference

Researchers introduce MAVRL, a machine learning approach that learns reward functions from multiple heterogeneous feedback types (demonstrations, comparisons, ratings, stops) simultaneously using Bayesian inference and amortized variational inference. The method eliminates manual loss balancing and demonstrates superior performance compared to single-feedback approaches across discrete and continuous control benchmarks.

AINeutralarXiv – CS AI · Jun 235/10

🧠

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

Researchers introduce UBP2, a model-based reinforcement learning method that improves sample efficiency in preference-based learning by actively directing exploration through uncertainty quantification across reward, dynamics, and value functions. The approach achieves sublinear regret guarantees and demonstrates substantially higher sample efficiency than existing methods on benchmark tasks.

AIBullisharXiv – CS AI · Jun 96/10

🧠

SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

Researchers introduce Stage-Aware Dynamic Weighting (SAW), a novel mechanism for multi-objective reinforcement learning in large language models that addresses the asynchronous nature of reward learning across different objectives. By using coefficient of variation as a real-time informativeness proxy, SAW dynamically reweights objective contributions to improve training efficiency and final performance with minimal computational overhead.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Reward Learning through Ranking Mean Squared Error

Researchers introduce R4 (Ranked Return Regression for RL), a new reinforcement learning method that learns reward functions from human ratings rather than binary preferences. The approach uses a novel ranking mean squared error loss and provides formal mathematical guarantees about solution completeness and minimality, demonstrating competitive or superior performance against existing methods on robotic benchmarks.

🏢 OpenAI🏢 Google

AIBullisharXiv – CS AI · Jun 26/10

🧠

T-POP: Test-Time Personalization with Online Preference Feedback

Researchers introduce T-POP, a novel algorithm that personalizes large language models in real-time by learning from user preference feedback during text generation, without requiring parameter updates or extensive pre-existing user data. The method combines test-time alignment with dueling bandits to efficiently balance exploration and exploitation, addressing the cold-start problem in LLM personalization.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Reward Learning from Best-of-$N$ Preference Data: Targets, Tradeoffs, and Design Principles

Researchers analyze how Best-of-N sampling constructs preference data for reward learning in AI systems, deriving closed-form targets and identifying a fundamental tradeoff between margin and connectivity governed by N size. The work provides design principles for practitioners: use larger N when preference labels are scarce, smaller N when generation capacity is limited, and optimize base distributions to prioritize comparisons most relevant at deployment.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

Researchers present a novel inverse reinforcement learning framework that handles multiple imperfect demonstrators with varying suboptimality levels, using a feasible-reward-set approach with linear constraints. The method includes theoretical guarantees for reward recovery and practical algorithms tested on grid-worlds and LLM fine-tuning, addressing a significant gap in real-world IRL applications.

AINeutralarXiv – CS AI · May 116/10

🧠

Active teacher selection for reward learning

Researchers introduce the Hidden Utility Bandit (HUB) framework to address a critical limitation in reward learning systems: their reliance on feedback from a single idealized teacher. The framework models teacher heterogeneity in rationality, expertise, and cost, enabling Active Teacher Selection (ATS) algorithms that strategically choose which teachers to query, demonstrating superior performance in paper recommendation and vaccine testing applications.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Learning Preference-Based Objectives from Clinical Narratives for Sequential Treatment Decision-Making

Researchers propose Clinical Narrative-informed Preference Rewards (CN-PR), a machine learning framework that extracts reward signals from patient discharge summaries to train reinforcement learning models for treatment decisions. The approach achieves strong alignment with clinical outcomes, including improved organ support-free days and faster shock resolution, offering a scalable alternative to traditional reward design in healthcare AI.