DIVA-GRPO: Enhancing Multimodal Reasoning through Difficulty-Adaptive Variant Advantage
Researchers have developed DIVA-GRPO, a new reinforcement learning method that improves multimodal large language model reasoning by adaptively adjusting problem difficulty distributions. The approach addresses key limitations in existing group relative policy optimization methods, showing superior performance across six reasoning benchmarks.