#reward-shaping News & Analysis

10 articles tagged with #reward-shaping. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

10 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

Reward Shaping for (Inference-Time) Alignment: A Stackelberg Game Perspective

Researchers propose a Stackelberg game framework for optimizing reward models in large language model alignment, addressing the suboptimality of standard KL-regularized reward optimization. A simple reward shaping scheme improves inference-time alignment by reducing base policy bias while mitigating reward hacking risks, demonstrating 66%+ win rates against baselines.

AIBullisharXiv – CS AI · Jun 87/10

🧠

SlimSearcher: Training Efficiency-Aware Web Agents via Adaptive Reward Gating

Researchers introduce SlimSearcher, a framework that trains AI web agents to perform complex information-seeking tasks with 17-58% fewer tool calls while maintaining or improving accuracy. The approach combines efficient trajectory filtering during supervised fine-tuning with adaptive reward gating during reinforcement learning to eliminate wasteful search behaviors.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Learning Process Rewards via Success Visitation Matching for Efficient RL

Researchers propose a novel reinforcement learning approach that converts sparse task rewards into dense process rewards by training a discriminator to identify successful episodes and incentivize policies to match their state-action visitations. The method demonstrates significantly faster training on robotic manipulation tasks without altering the optimal policy.

AINeutralarXiv – CS AI · Jun 16/10

🧠

When LLM Reward Design Fails: Diagnostic-Driven Refinement for Sparse Structured RL

Researchers demonstrate that LLM-generated reward functions for reinforcement learning tasks fail in predictable ways and are better treated as an iterative debugging process rather than one-shot generation. Using diagnostic-driven refinement guided by failure-mode taxonomy, they improve task success rates significantly (DoorKey-8x8: 2.3% to 97.6%), though the method shows limitations in dense-reward continuous control and requires reliable semantic interfaces.

AINeutralarXiv – CS AI · May 286/10

🧠

PIRS: Physics-Informed Reward Shaping for SAC-Based Building Energy Management

Researchers introduce PIRS (Physics-Informed Reward Shaping), a method that improves deep reinforcement learning controllers for building energy management by replacing ad-hoc comfort metrics with ISO 7730 Predicted Mean Vote (PMV) standards. Tested on CityLearn v2.1.2, PIRS demonstrates competitive performance against manual baselines while substantially outperforming non-physics-grounded approaches in load ramping and peak demand metrics.

AINeutralarXiv – CS AI · May 126/10

🧠

OracleTSC: Oracle-Informed Reward Hurdle and Uncertainty Regularization for Traffic Signal Control

Researchers introduce OracleTSC, an LLM-based traffic signal control system that combines reward hurdle mechanisms and uncertainty regularization to stabilize reinforcement learning training. The approach achieves 75% reduction in travel time while maintaining interpretability through natural language explanations, with strong cross-intersection generalization capabilities.

AINeutralarXiv – CS AI · May 126/10

🧠

PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning

Researchers introduce PiCA (Pivot-Based Credit Assignment), a novel reinforcement learning mechanism that improves how LLM-based search agents learn from long sequences of actions. By identifying key pivot steps and anchoring rewards to final task outcomes, PiCA addresses critical challenges in credit assignment, delivering 15.2% performance gains on knowledge-intensive QA tasks.

AINeutralarXiv – CS AI · May 96/10

🧠

Owen-Shapley Policy Optimization: A Principled RL Algorithm for Generative Search LLMs

Researchers introduce Owen-Shapley Policy Optimization (OSPO), a reinforcement learning algorithm that improves how language models learn from feedback by attributing credit to individual tokens rather than treating entire sequences as atomic units. The method addresses a fundamental training gap in generative AI systems used for recommendation tasks, showing measurable improvements on real e-commerce datasets.

AIBullisharXiv – CS AI · Apr 146/10

🧠

The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping

Researchers introduce MEDS, a memory-enhanced reward shaping framework that addresses a critical reinforcement learning failure mode where language models repeatedly generate similar errors. By tracking historical behavioral patterns and penalizing recurring mistake clusters, the method achieves consistent performance improvements across multiple datasets and models while increasing sampling diversity.

AIBullisharXiv – CS AI · Mar 36/108

🧠

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

Researchers introduce Multi-View Video Reward Shaping (MVR), a new reinforcement learning framework that uses multi-viewpoint video analysis and vision-language models to improve reward design for complex AI tasks. The system addresses limitations of single-image approaches by analyzing dynamic motions across multiple camera angles, showing improved performance on humanoid locomotion and manipulation tasks.