StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning
StepOPSD introduces a novel reinforcement learning framework that improves credit assignment in multi-turn agent tasks by treating individual steps rather than entire trajectories as the unit of learning. The method achieves state-of-the-art results on benchmark tasks like ALFWorld and Search-QA, demonstrating that step-level preference distillation is particularly effective when trajectory rewards poorly correlate with individual decision quality.
StepOPSD addresses a fundamental challenge in reinforcement learning for agents: the credit-assignment problem. Traditional approaches reward or penalize entire trajectories despite success often depending on a handful of critical local decisions, creating misalignment between training signals and actual causality. The researchers decompose agent trajectories into action-centered segments and rescore them with enriched context, converting token-level probability differences into step-specific advantage signals before policy updates. This granular approach shows empirical strength on navigation and question-answering benchmarks, with particularly impressive gains on tasks where trajectory-level rewards weakly correlate with local action importance. The framework identifies what the authors call a "two-knob law," revealing that smaller clipping thresholds provide consistent stabilization while optimal mixing strength varies by task. These findings suggest that step-aware distillation fundamentally changes how agents learn from trajectories, especially in complex reasoning scenarios. The technique bridges online policy distillation improvements with practical step-level supervision, addressing limitations in treating heterogeneous agent interactions as monolithic sequences. The consistent performance improvements across multiple model scales and domains indicates this approach captures something meaningful about how agents should learn from experience.
- βStepOPSD decomposes trajectories into action-centered steps for more precise credit assignment in multi-turn agent learning.
- βThe method achieves first-place results on multiple ALFWorld and Search-QA benchmarks, including 79.1% on ALFWorld Heat and 95.0% on PickTwo.
- βStep-aware distillation provides greatest benefits when trajectory-level rewards misalign with the local decisions determining downstream success.
- βThe framework's two-knob law shows that smaller clipping values stabilize learning broadly while optimal mixing strength remains task-dependent.
- βResults span multiple model scales (Qwen3-1.7B and Qwen2.5-3B), suggesting the approach generalizes beyond single architectures.