AIBullisharXiv – CS AI · 9h ago6/10
🧠
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
Researchers propose CTPO (Cumulative Token Policy Optimization), a new approach to reinforcement learning for large language models that addresses the bias-variance tradeoff in importance sampling ratios. By using cumulative token-level ratios with position-adaptive clipping, CTPO achieves superior performance on mathematical reasoning benchmarks compared to existing methods like PPO and GRPO.