AIBullisharXiv – CS AI · 15h ago6/10
🧠
Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation
Researchers propose Tournament-GRPO, a novel reinforcement learning framework that uses group-wise tournament comparisons instead of absolute scoring to improve long-form text generation. By converting rubric-based LLM judgments into relative rewards through competitive rankings, the method achieves 4.52-point improvements over existing approaches on Deep Research Bench benchmarks.