Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning
Researchers propose Chunk-Level Guided Generation, a training-free method using off-the-shelf large language models to score intermediate reasoning steps during small-model inference for mathematical problem-solving. The approach matches or outperforms specialized reward model-based systems on benchmarks like MATH and GSM8K without requiring expensive step-level training data.
This research addresses a fundamental limitation in leveraging smaller language models for reasoning tasks. Traditional approaches either use majority voting on final answers or employ Process Reward Models (PRMs) that require expensive training with step-level annotations. The proposed Chunk-Level Guided Generation sidesteps training requirements by repurposing existing large models as scorers, making the technique immediately applicable to practitioners without specialized infrastructure.
The key innovation lies in scoring fixed-length reasoning chunks rather than variable-length steps. The authors identify and solve a critical technical problem: length bias in log-probability scoring persists even after normalization when step lengths vary. By constraining chunks to fixed lengths, they eliminate this confound and enable reliable scoring via simple likelihood comparisons. The Contrastive-Guided Selection variant further improves results by identifying where larger and smaller models disagree, surfacing genuinely higher-quality continuations.
From an industry perspective, this work democratizes guided reasoning inference. Previously, achieving strong mathematical reasoning required training specialized reward models—an expensive, specialized task. Now teams can deploy guidance with only API access to a capable LLM, significantly lowering barriers to implementation. The 28 percentage-point improvement over majority voting on some benchmarks demonstrates substantial performance gains from modest computational overhead.
The practical implications extend beyond mathematics. Any domain requiring step-by-step reasoning—code generation, logical inference, complex planning—could benefit from this framework. The shorter reasoning traces produced compared to PRM-guided search also reduce computational costs and improve interpretability. Future work likely explores optimal chunk lengths for different domains and integration with emerging efficient inference techniques.
- →Training-free guidance using off-the-shelf LLMs matches specialized reward models on mathematical reasoning benchmarks without step-level annotation costs.
- →Fixed-length chunks eliminate length bias in likelihood-based scoring, a systematic problem that persists after traditional normalization.
- →Contrastive-Guided Selection improves performance by prioritizing reasoning chunks where larger and smaller models diverge.
- →The method reduces reasoning trace length compared to PRM-guided search, lowering computational costs while improving interpretability.
- →Results across five mathematical benchmarks demonstrate 4-28 percentage-point improvements over majority voting with matched computational budgets.