AIBullisharXiv – CS AI · 8h ago7/10
🧠
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment
Researchers introduce Distribution Guided Policy Optimization (DGPO), a novel reinforcement learning framework that improves how large language models learn to perform complex reasoning tasks by assigning credit at the token level rather than sequence level. DGPO replaces unstable KL divergence penalties with bounded Hellinger distance and adds an entropy gating mechanism, achieving state-of-the-art performance on challenging math benchmarks like AIME2024 and AIME2025.