AINeutralarXiv – CS AI · 15h ago6/10
🧠
StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
StepOPSD introduces a novel reinforcement learning framework that improves credit assignment in multi-turn agent tasks by treating individual steps rather than entire trajectories as the unit of learning. The method achieves state-of-the-art results on benchmark tasks like ALFWorld and Search-QA, demonstrating that step-level preference distillation is particularly effective when trajectory rewards poorly correlate with individual decision quality.