y0news
AnalyticsDigestsSourcesRSSAICrypto
#capo1 article
1 articles
AIBullisharXiv โ€“ CS AI ยท 5d ago7/103
๐Ÿง 

Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

Researchers have developed Curvature-Aware Policy Optimization (CAPO), a new algorithm that improves training stability and sample efficiency for Large Language Models by up to 30x. The method uses advanced mathematical optimization techniques to identify and filter problematic training samples, requiring intervention on fewer than 8% of tokens.