AIBullisharXiv โ CS AI ยท 4h ago7
๐ง
Real-Time Aligned Reward Model beyond Semantics
Researchers introduce R2M (Real-Time Aligned Reward Model), a new framework for Reinforcement Learning from Human Feedback (RLHF) that addresses reward overoptimization in large language models. The system uses real-time policy feedback to better align reward models with evolving policy distributions during training.