Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO
Researchers propose S2L-PO, a framework that uses smaller language models as natural policy explorers to train larger models more efficiently. By leveraging the inherent policy-level diversity of smaller models rather than token-level randomness, the approach achieves significant accuracy improvements on mathematical reasoning tasks while reducing computational costs.
The research addresses a fundamental challenge in training large language models: generating diverse rollouts for effective policy optimization without introducing incoherent sampling noise. Traditional approaches to diversity in Group Relative Policy Optimization rely on injecting randomness at the token level, which can disrupt logical consistency across generated sequences. This work identifies that smaller models within the same family naturally produce more diverse, logically coherent outputs relative to their capacity, suggesting a more structured exploration strategy than random perturbation.
The S2L-PO framework elegantly repurposes smaller models as fixed explorers to guide larger model training, addressing a broader trend in AI efficiency research focused on leveraging model heterogeneity. The progressive annealing mechanism—transitioning from small-model rollouts to the large model's own sampling—solves a critical practical problem: avoiding performance degradation during transition while maintaining exploration benefits. Results demonstrating 8.8% accuracy improvement on AIME 24 benchmarks using a 1.7B model to guide an 8B learner indicate meaningful gains in mathematical reasoning capabilities.
For the AI industry, this research offers practical training cost reductions while improving model performance, directly impacting the economics of large model development. The framework suggests that computational resources allocated to diverse sampling can be partially redistributed toward larger model capacity, improving efficiency frontiers. The approach has implications for scaling laws research and suggests alternative training paradigms beyond simple compute-scaling strategies. Future developments may explore this principle across different model families and tasks, potentially informing how organizations design multi-model training architectures for optimal cost-performance tradeoffs.
- →Smaller models exhibit superior policy-level diversity that improves exploration in language model training without introducing incoherent noise
- →S2L-PO framework reduces rollout compute while achieving 8.8% accuracy gains on mathematical reasoning benchmarks
- →Progressive annealing strategy enables seamless transition from small-model guidance to large-model self-sampling, avoiding mid-training performance drops
- →Temporally-correlated diversity from smaller models provides structured exploration signals superior to token-level randomness injection
- →Framework suggests alternative training paradigms that leverage model heterogeneity to improve computational efficiency in large language model development