Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning
Researchers have developed Curvature-Aware Policy Optimization (CAPO), a new algorithm that improves training stability and sample efficiency for Large Language Models by up to 30x. The method uses advanced mathematical optimization techniques to identify and filter problematic training samples, requiring intervention on fewer than 8% of tokens.