Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation
Researchers introduce AdaGRPO, a reinforcement learning framework that selectively applies reward signals in generative recommendation systems rather than uniformly, addressing the problem of noisy reward models trained on biased data. The approach combines supervised learning with adaptive gating mechanisms and demonstrates significant improvements in e-commerce recommendation metrics and production performance.
The article addresses a fundamental challenge in applying reinforcement learning to recommendation systems: production reward models trained on exposure-biased historical data provide unreliable guidance for policy optimization. Traditional approaches apply reward signals uniformly across all samples, but the research reveals this amplifies noise and degrades performance on samples where the ranker cannot effectively distinguish relevant items from negatives.
AdaGRPO reframes reward-guided optimization as selective admission rather than uniform application. The framework anchors training in supervised learning while gating the RL objective through two diagnostic checks: policy-side difficulty and reward discriminability. Samples failing either check revert to pure supervision, preventing noisy gradient amplification. This hybrid approach reflects a broader trend in machine learning toward adaptive, sample-aware training strategies that recognize heterogeneous data quality.
The experimental validation on large-scale e-commerce data shows concrete improvements: hitting 12.18% HR@10 at the best checkpoint while keeping hallucinations below 0.22%, with maintained robustness at final training stages. Production A/B testing confirmed statistically significant gains in click-through rate and dwell time, translating research advances into measurable business impact. This validates that addressing noise in reward signals directly improves real-world recommendation quality.
The work has implications for the broader RL-for-recommendation space, suggesting that principled frameworks for handling imperfect reward models outperform simpler fixed-weight combinations. As recommendation systems increasingly incorporate RL components, adaptive noise-handling mechanisms may become standard practice.
- βAdaGRPO selectively applies reward signals based on per-sample diagnostics rather than uniform application, improving recommendation quality.
- βThe framework achieves 12.18% HR@10 on large-scale e-commerce data while constraining hallucinations below 0.22%.
- βProduction A/B tests demonstrate statistically significant improvements in click-through rate and dwell time metrics.
- βReward models trained on exposure-biased logs require adaptive gating to prevent noisy gradient amplification in RL training.
- βThe approach combines supervised learning as a stable anchor with selective RL optimization for enhanced robustness.