Researchers introduce RoRo, a novel framework for stepwise model routing in Large Reasoning Models that uses process-based rewards rather than outcome-only rewards to evaluate intermediate routing decisions. The approach combines rubric-guided evaluation with reinforcement learning to improve efficiency and accuracy across multiple reasoning benchmarks.
RoRo addresses a fundamental limitation in how Large Reasoning Models allocate computational resources across reasoning steps. Traditional routing systems rely exclusively on outcome rewards—whether a final answer is correct—which provides insufficient signal for optimizing intermediate decisions. This creates a training blind spot where models cannot learn which routing choices led to correct answers, only that certain final states were desirable. The framework tackles this by introducing process rewards that evaluate the quality of routing decisions themselves, not just their endpoints.
The technical innovation centers on two components: a Rubricor that generates task-specific evaluation criteria and a Judge that scores routing trajectories against these criteria. This two-stage approach allows the system to understand what makes a good intermediate decision in context, rather than applying generic quality metrics. By combining process rewards with outcome rewards through GRPO optimization, RoRo captures both the journey and destination of reasoning steps.
For the AI infrastructure industry, this research has meaningful implications for model efficiency and cost reduction. Large Reasoning Models represent significant computational expense; improving routing efficiency directly translates to lower inference costs and faster response times. The cross-family testing results—where models trained on one family of models generalize to another—suggest practical applicability across different architectures.
The methodology extends beyond routing to broader questions about reinforcement learning in multi-step reasoning systems. Future work likely involves scaling these insights to larger model ensembles and exploring whether process rewards can improve other aspects of language model behavior beyond routing decisions.
- →RoRo replaces outcome-only rewards with process-guided rewards to better optimize intermediate routing decisions in Large Reasoning Models
- →The framework uses context-specific evaluation rubrics generated by a Rubricor component to assess routing trajectory quality
- →Testing across five reasoning benchmarks shows consistent improvements in both accuracy and efficiency-cost tradeoffs compared to baseline methods
- →Process-based training improves generalization across different model families, indicating practical applicability to diverse architectures
- →This approach could significantly reduce inference costs for multi-model reasoning systems by optimizing which tasks route to which models