AIBullisharXiv – CS AI · 6h ago7/10
🧠
AGPO: Asymmetric Group Policy Optimization for Verifiable Reasoning and Search Ads Relevance at JD
Researchers introduce Asymmetric Group Policy Optimization (AGPO), a reinforcement learning method that improves LLM reasoning by preventing capability collapse while focusing on rare correct solutions. The technique demonstrates state-of-the-art performance on mathematical benchmarks and has been deployed in JD's search ads relevance system, showing practical industrial applications.