🧠 AI⚪ NeutralImportance 6/10

RARM: Confidence-Gated Progress Reward Modeling for RL in Manipulation

arXiv – CS AI|Pengzhi Yang, Xinyu Wang, Pengyu Jing, Kehan Wen, Yiduo Qu, Zhenhao Huang, Minghao Fu, Xin Liu, Yaheng Shen, Fan Shi|June 23, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce RARM (Reference-Anchored Reward Model), a visual AI system that solves a major bottleneck in robot learning by converting single successful demonstrations into dense reward signals without task-specific engineering. The approach uses confidence-gated progress matching to avoid false-positive rewards, achieving superior performance across simulated and real-world manipulation tasks.

Analysis

The paper addresses a fundamental challenge in reinforcement learning for robotics: designing effective reward functions that guide learning without excessive manual engineering or task-specific data. Traditional approaches rely on either sparse rewards that provide minimal learning signal or hand-crafted dense rewards that require significant expertise and don't transfer across tasks. RARM represents a meaningful step toward more generalizable robot learning by training once on generic video data and adapting at deployment through reference-based visual comparison.

This work builds on growing recognition that progress-based rewards offer better inductive biases for long-horizon tasks than traditional success/failure signals. Prior systems often suffered from assigning high rewards to visually similar but physically invalid states—a critical failure mode in robotic manipulation where state appearance can be deceiving. RARM's confidence-gating mechanism directly addresses this by only rewarding progress when the visual matcher is sufficiently certain, reducing spurious rewards that derail learning.

The results across 13 tasks (9 simulated, 4 real-world) demonstrate practical value, particularly for complex, long-horizon manipulation like cloth folding where progress estimation errors compound. Success on real-world tasks suggests the approach generalizes beyond simulation, though the sample size remains limited. The method's lightweight architecture and independence from robot-specific data increase accessibility for researchers with varied experimental setups.

Future development should explore whether RARM's approach scales to multi-object scenes, dynamic environments, or tasks requiring subtle dexterity. The reliance on reference demonstrations, while minimal compared to alternatives, still requires task completion once before learning begins—a practical limitation for novel scenarios.

Key Takeaways

→RARM enables dense reward generation from single successful demonstrations without task-specific labels or robot data
→Confidence-gated matching prevents false-positive rewards that plague existing progress-based reward models
→System demonstrates strongest gains on long-horizon tasks where sparse rewards and progress estimation errors are most harmful
→One-time pre-training on generic videos enables adaptation to new tasks at deployment without retraining
→Real-world validation across four manipulation tasks indicates practical viability beyond simulation environments