AINeutralarXiv – CS AI · 3h ago5/10
🧠
Where Rollouts Begin: Low-Load, High-Leverage First-Token Diversification for RLVR
Researchers introduce REFT, a method that improves Reinforcement Learning with Verifiable Rewards (RLVR) by diversifying the first token generated after reasoning markers, addressing a previously overlooked bottleneck in rollout diversity. The technique achieves measurable improvements across multiple model sizes and difficulty levels without requiring changes to existing RLVR pipelines.