🧠 AI🟢 BullishImportance 6/10

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

arXiv – CS AI|Kewei Xu, Junbo Qi, Yanyan Zou, Pengfei Zhang, Xingzhi Yao, Shengjie Li|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce AdaGRPO, a reinforcement learning framework that selectively applies reward signals in generative recommendation systems rather than uniformly, addressing the problem of noisy reward models trained on biased data. The approach combines supervised learning with adaptive gating mechanisms and demonstrates significant improvements in e-commerce recommendation metrics and production performance.

Analysis

The article addresses a fundamental challenge in applying reinforcement learning to recommendation systems: production reward models trained on exposure-biased historical data provide unreliable guidance for policy optimization. Traditional approaches apply reward signals uniformly across all samples, but the research reveals this amplifies noise and degrades performance on samples where the ranker cannot effectively distinguish relevant items from negatives.

AdaGRPO reframes reward-guided optimization as selective admission rather than uniform application. The framework anchors training in supervised learning while gating the RL objective through two diagnostic checks: policy-side difficulty and reward discriminability. Samples failing either check revert to pure supervision, preventing noisy gradient amplification. This hybrid approach reflects a broader trend in machine learning toward adaptive, sample-aware training strategies that recognize heterogeneous data quality.

The experimental validation on large-scale e-commerce data shows concrete improvements: hitting 12.18% HR@10 at the best checkpoint while keeping hallucinations below 0.22%, with maintained robustness at final training stages. Production A/B testing confirmed statistically significant gains in click-through rate and dwell time, translating research advances into measurable business impact. This validates that addressing noise in reward signals directly improves real-world recommendation quality.

The work has implications for the broader RL-for-recommendation space, suggesting that principled frameworks for handling imperfect reward models outperform simpler fixed-weight combinations. As recommendation systems increasingly incorporate RL components, adaptive noise-handling mechanisms may become standard practice.

Key Takeaways

→AdaGRPO selectively applies reward signals based on per-sample diagnostics rather than uniform application, improving recommendation quality.
→The framework achieves 12.18% HR@10 on large-scale e-commerce data while constraining hallucinations below 0.22%.
→Production A/B tests demonstrate statistically significant improvements in click-through rate and dwell time metrics.
→Reward models trained on exposure-biased logs require adaptive gating to prevent noisy gradient amplification in RL training.
→The approach combines supervised learning as a stable anchor with selective RL optimization for enhanced robustness.