AIBullisharXiv – CS AI · 10h ago7/10
🧠
DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards
Researchers introduce DUET, a method for optimizing token allocation in reinforcement learning with verifiable rewards that jointly controls which prompts receive rollouts and how long each rollout runs. The technique achieves superior reasoning quality on math and coding benchmarks while using 50% fewer tokens than baseline methods, suggesting efficiency gains don't require sacrificing model performance.
🧠 Llama