y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

arXiv – CS AI|Yang Xu, Kun Yao, Yiming Deng, Zheng Fang, Kai Ming Ting, Ming Pang|
🤖AI Summary

Researchers introduce Asymmetric Group Policy Optimization (AGPO), a reinforcement learning method that improves LLM reasoning by preventing capability collapse while focusing on rare correct solutions. The technique demonstrates state-of-the-art performance on mathematical benchmarks and has been deployed in JD's search ads relevance system, showing practical industrial applications.

Analysis

AGPO addresses a critical limitation in current reinforcement learning approaches for large language models: while existing methods improve sampling efficiency toward correct answers, they inadvertently narrow the reasoning capability boundary compared to base models. This phenomenon represents a fundamental trade-off between optimization and exploration that has constrained real-world LLM deployment. The proposed solution employs an asymmetric strategy—suppressing incorrect reasoning paths through negative-dominant reinforcement while using a group advantage mechanism to identify and amplify rare correct reasoning patterns based on intra-group variance.

The research builds on growing recognition that LLM reasoning requires sophisticated reward mechanisms beyond simple correctness signals. Prior work in reinforcement learning with verifiable rewards established that models could improve accuracy, but these improvements came at the cost of reducing the diversity and generality of reasoning approaches. AGPO's innovation lies in decoupling the optimization process, treating positive and negative feedback asymmetrically to preserve exploration capacity.

The industrial deployment at JD in search ads relevance demonstrates tangible business value beyond academic benchmarks. By enhancing data annotation quality, AGPO improves downstream student model performance in a production environment handling large-scale commercial decisions. This application validates the method's robustness and practical utility. The consistent improvements in pass@k metrics across mathematical benchmarks suggest AGPO could benefit other domains requiring reliable multi-step reasoning. Future work should investigate how this approach scales to larger models and more complex reasoning tasks beyond mathematics.

Key Takeaways
  • AGPO prevents the capability collapse observed in existing reinforcement learning methods by maintaining base model exploration capacity.
  • The asymmetric optimization strategy separates negative and positive reinforcement, with group advantage scaling based on intra-group variance.
  • State-of-the-art results achieved on five mathematical benchmarks with improved pass@k performance at scale.
  • Real-world deployment at JD improved search ads relevance annotation quality and downstream model performance.
  • The approach demonstrates how sophisticated reward mechanisms can preserve reasoning diversity while improving accuracy.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles