Rethinking the Trust Region in LLM Reinforcement Learning
Researchers propose Divergence Proximal Policy Optimization (DPPO), a replacement for PPO's ratio clipping mechanism that better handles the large vocabularies in LLM fine-tuning. The new approach uses direct policy divergence estimates instead of noisy token probability ratios, offering improved training stability and efficiency.