KnowRL: Boosting LLM Reasoning via Reinforcement Learning with Minimal-Sufficient Knowledge Guidance
Researchers introduce KnowRL, a reinforcement learning framework that improves large language model reasoning by using minimal, strategically-selected knowledge points rather than verbose hints. The approach achieves state-of-the-art results on reasoning benchmarks at the 1.5B parameter scale, with the trained model and code made publicly available.
KnowRL addresses a fundamental challenge in reinforcement learning for language models: how to guide training without introducing computational bloat and inconsistency. Traditional hint-based RL methods improve performance by injecting partial solutions, but they scale inefficiently by adding excessive tokens that create redundancy and training overhead. This research reframes hint design as an optimization problem, decomposing guidance into atomic knowledge points and using Constrained Subset Search to identify the minimal set needed for effective training.
The framework tackles a nuanced technical challenge termed the 'pruning interaction paradox,' where removing individual knowledge points helps performance but removing multiple simultaneously causes degradation. This insight reflects real-world dependencies in reasoning tasks—certain knowledge combinations matter contextually. By optimizing for robust subset curation, KnowRL achieves notable empirical gains: the 1.5B parameter model reaches 70.08% average accuracy without hints at inference (a +9.63 point improvement), and 74.16% with selected hints.
For the AI development community, this work demonstrates that efficiency and effectiveness in model training aren't mutually exclusive. Open-sourcing the model, training data, and code accelerates reproducibility and adoption. The methodology applies beyond mathematics reasoning, potentially benefiting other domains requiring step-by-step problem solving. This represents iterative progress in making reasoning capabilities more accessible at smaller model scales, relevant as organizations seek performant models with lower computational costs.
- →KnowRL uses minimal, interaction-aware knowledge points instead of verbose hints to guide RL training more efficiently
- →The framework identifies and optimizes for the 'pruning interaction paradox' where knowledge point dependencies affect training outcomes
- →KnowRL-Nemotron-1.5B achieves 70.08% baseline accuracy and 74.16% with hints, establishing new state-of-the-art for this model scale
- →Open-source release of model, training data, and code enables reproducibility and broader adoption of the approach
- →The research demonstrates efficiency gains in RL training by reducing token overhead and eliminating redundancy in guidance