Sketch-and-Verify: Structured Inference-Time Scaling via Program Sketching
Sketch-and-Verify is an inference-time scaling technique that improves small language model performance by having the LLM generate multiple algorithmic strategies as program sketches, then filling and verifying them. On HumanEval+, this approach delivers superior cost-performance within a model tier compared to flat sampling, though upgrading to a stronger model tier remains more effective than scaling test-time compute on smaller models.
Sketch-and-Verify addresses a practical constraint facing many practitioners: working with smaller, cheaper models due to latency, deployment, or budget constraints while seeking marginal accuracy improvements. Rather than generating redundant candidate solutions through flat sampling, the method structures the search space by explicitly factorizing across algorithmic diversity (K sketches) and sample count (M fills per sketch), guaranteeing that each additional sketch explores a fundamentally different approach rather than duplicating existing solutions.
The research demonstrates compelling Pareto improvements on HumanEval+, a challenging programming benchmark. On 19 problems where Gemini 3.1 Flash Lite fails with greedy decoding, the technique achieves 79% recovery with K=10, M=10 compared to 53% for flat sampling at 3x the budget. This within-tier advantage reflects a principled approach to test-time scaling that contrasts with brute-force sampling increases.
However, the findings establish clear economic boundaries. Pro-tier greedy decoding dominates Lite Sketch across both accuracy and cost metrics, establishing that model capability remains the primary lever. The research characterizes this as a hierarchy of optimization: first upgrade models when possible, then apply sketching as the next cost-effective tier. The method's cleanest contribution lies in structuring search space exploration rather than claiming fundamental breakthroughs in small-model capabilities.
For practitioners, this offers a principled middle ground—applicable when tier upgrades are unavailable or unaffordable, with measured improvements that follow predictable K-M trade-offs. The work validates program sketching as a genuine algorithmic diversity mechanism rather than parametric scaling.
- →Sketch-and-Verify recovers 79% accuracy on hard HumanEval+ problems vs 53% for flat sampling at 3x budget, demonstrating structured diversity beats redundant sampling
- →Upgrading to a stronger model tier delivers better cost-performance than scaling test-time compute on smaller models, establishing a clear optimization hierarchy
- →The method guarantees algorithmic diversity by having the LLM enumerate distinct strategies with partial programs (sketches), preventing duplicate solution exploration
- →K-vs-M trade-off characterization shows predictable scaling behavior, enabling practitioners to tune compute allocation between strategy count and samples per strategy
- →Technique composes with execution-based selection and semantic voting, suggesting compatibility with broader verification-based inference approaches