Online Causal Kalman Filtering for Stable and Effective Policy Optimization
Researchers propose Online Causal Kalman Filtering for Policy Optimization (KPO) to address high-variance instability in reinforcement learning for large language models. The method uses Kalman filtering to smooth token-level importance sampling ratios, preventing training collapse and achieving superior results on math reasoning tasks.