🧠 AI🟢 BullishImportance 7/10

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

arXiv – CS AI|Yang Xu, Kun Yao, Yiming Deng, Zheng Fang, Kai Ming Ting, Ming Pang|May 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce Asymmetric Group Policy Optimization (AGPO), a reinforcement learning method that improves LLM reasoning by preventing capability collapse while focusing on rare correct solutions. The technique demonstrates state-of-the-art performance on mathematical benchmarks and has been deployed in JD's search ads relevance system, showing practical industrial applications.

Analysis

AGPO addresses a critical limitation in current reinforcement learning approaches for large language models: while existing methods improve sampling efficiency toward correct answers, they inadvertently narrow the reasoning capability boundary compared to base models. This phenomenon represents a fundamental trade-off between optimization and exploration that has constrained real-world LLM deployment. The proposed solution employs an asymmetric strategy—suppressing incorrect reasoning paths through negative-dominant reinforcement while using a group advantage mechanism to identify and amplify rare correct reasoning patterns based on intra-group variance.

The research builds on growing recognition that LLM reasoning requires sophisticated reward mechanisms beyond simple correctness signals. Prior work in reinforcement learning with verifiable rewards established that models could improve accuracy, but these improvements came at the cost of reducing the diversity and generality of reasoning approaches. AGPO's innovation lies in decoupling the optimization process, treating positive and negative feedback asymmetrically to preserve exploration capacity.

The industrial deployment at JD in search ads relevance demonstrates tangible business value beyond academic benchmarks. By enhancing data annotation quality, AGPO improves downstream student model performance in a production environment handling large-scale commercial decisions. This application validates the method's robustness and practical utility. The consistent improvements in pass@k metrics across mathematical benchmarks suggest AGPO could benefit other domains requiring reliable multi-step reasoning. Future work should investigate how this approach scales to larger models and more complex reasoning tasks beyond mathematics.

Key Takeaways

→AGPO prevents the capability collapse observed in existing reinforcement learning methods by maintaining base model exploration capacity.
→The asymmetric optimization strategy separates negative and positive reinforcement, with group advantage scaling based on intra-group variance.
→State-of-the-art results achieved on five mathematical benchmarks with improved pass@k performance at scale.
→Real-world deployment at JD improved search ads relevance annotation quality and downstream model performance.
→The approach demonstrates how sophisticated reward mechanisms can preserve reasoning diversity while improving accuracy.

#reinforcement-learning #large-language-models #llm-reasoning #agpo #jd-commerce #mathematical-reasoning #policy-optimization #ai-training

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AI2d ago

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AI2d ago

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AI3d ago

AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge