INFUSER: Influence-Guided Self-Evolution Improves Reasoning
INFUSER is a novel self-evolution framework that enables language models to improve their reasoning capabilities through an iterative co-training process between a Generator and Solver, using an influence-aware scoring mechanism rather than difficulty heuristics. The method achieves 20% relative improvement on mathematical and coding benchmarks, demonstrating that adaptive curriculum learning can outperform larger frozen models.
INFUSER addresses a fundamental challenge in AI model development: scaling reasoning capabilities without reliance on expensive human annotation or teacher models. The framework introduces an optimizer-aware influence score that evaluates whether generated questions meaningfully advance solver performance on target benchmarks, moving beyond simplistic difficulty-based selection. This distinction matters because hard questions don't necessarily improve model capabilities—they must align with the learner's current weaknesses and development trajectory.
The dual-normalized GRPO (DuGRPO) variant represents a technical contribution to reinforcement learning from continuous, noisy feedback signals. Traditional GRPO methods struggle with influence scores that lack the clean binary correctness signal available to the Solver. By implementing dual normalization, the framework stabilizes training for the Generator role, enabling it to reliably identify beneficial questions from unstructured document pools.
The results carry significant implications for cost-efficient model training. An 8B parameter INFUSER generator outperforming a frozen 32B thinking-focused generator on mathematical tasks suggests that curriculum quality matters more than raw model scale. This finding could reshape training economics, particularly for organizations without access to massive computational resources or proprietary training data.
The framework's flexibility—demonstrated through extensions to instruction-finetuned models and integration with rule-verifiable reinforcement learning—indicates potential broader applicability beyond pure reasoning benchmarks. As organizations increasingly pursue in-house model development, adaptive self-evolution methods could become foundational infrastructure. The open-source release enables community validation and extension.
- →INFUSER uses optimizer-aware influence scoring to generate adaptive curricula, outperforming difficulty-based question selection by 20% on reasoning benchmarks
- →An 8B INFUSER co-evolving generator surpasses a frozen 32B thinking model, suggesting curriculum quality matters more than raw model size for reasoning improvement
- →DuGRPO handles continuous, noisy influence signals more effectively than standard GRPO, enabling stable Generator training
- →Framework demonstrates cost-efficient self-evolution path requiring minimal external supervision and unstructured document pools
- →Open-source release and flexible architecture enable integration with instruction-finetuning and rule-verifiable reinforcement learning approaches