Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
Researchers introduce Selective-Advantage Adaptive-Horizon GRPO (SA-AH-GRPO), an improved reinforcement learning algorithm for language models that applies asymmetric token-level discounting to stabilize training on reasoning tasks. The method achieves 3.6x reduction in training variance while maintaining peak performance on mathematical reasoning benchmarks, demonstrating more efficient model alignment without sacrificing accuracy.
This research addresses a fundamental inefficiency in Group Relative Policy Optimization (GRPO), a promising RL algorithm for aligning language models on complex reasoning tasks. Standard GRPO treats all token positions and rollouts equally, which wastes computational resources on failed trajectories and introduces unnecessary variance during training. The proposed SA-AH-GRPO method introduces entropy-adaptive horizon weighting that contracts the effective planning horizon when model uncertainty is high, while crucially applying this constraint only to negative-advantage (failed) trajectories. This asymmetric approach preserves full gradient signals on successful solutions, preventing the algorithm from over-correcting on what already works.
The work sits within the broader context of making language model training more sample-efficient and stable. As models scale larger and reasoning tasks grow more complex, variance reduction becomes critical for practical deployment. The empirical results are compelling: on Qwen 2.5-3B with mathematical reasoning, SA-AH-GRPO maintains 84.6% accuracy at 180 training steps while reducing variance to 0.0246, a substantial improvement over baseline GRPO's 0.088 variance. Performance holds across model sizes, with the 1.5B variant improving 7.6 percentage points over zero-shot baseline.
For the broader AI development community, this demonstrates how principled modifications to RL algorithms can yield significant practical gains. The insight that asymmetric discounting prevents entropy collapse while maintaining gradient fidelity on correct solutions has implications beyond reasoning tasks—it suggests a general principle for structured generation problems with verifiable rewards. Developers optimizing language model fine-tuning workflows could benefit from adopting these techniques, particularly in resource-constrained settings where variance reduction directly translates to faster convergence and lower computational costs.
- →SA-AH-GRPO reduces training variance by 3.6x compared to standard GRPO while maintaining peak mathematical reasoning accuracy
- →Asymmetric token-level discounting preserves full gradient signals on successful trajectories while dampening unhelpful updates from failures
- →The method prevents entropy collapse and stabilizes training through entropy-adaptive horizon weighting on uncertain tokens
- →Peak Pass@1 accuracy reaches 85.8% on GSM8K benchmark for 3B model with substantially reduced training instability
- →Research suggests practical efficiency gains for language model fine-tuning in resource-constrained environments