More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)
Researchers propose Reset-and-Discard (ReD), a novel querying method that improves large language model inference efficiency by optimizing the coverage@cost metric—the number of unique questions answered within a fixed budget. The technique reduces computational attempts, tokens, and financial costs needed to achieve desired performance levels across coding, math, and reasoning tasks.
The research addresses a fundamental inefficiency in how large language models are evaluated and deployed. While pass@k measures correctness probability across multiple trials, it doesn't account for real-world budget constraints where computational costs and token usage directly impact operational expenses. ReD tackles this gap by shifting focus to coverage@cost, demonstrating that the empirically-observed power-law behavior in LLM performance creates diminishing returns—additional attempts yield progressively smaller improvements.
The methodology connects two previously separate evaluation frameworks and provides a quantitative prediction model for cost savings. By strategically resetting and discarding queries, ReD achieves measurable efficiency gains across diverse benchmarks including HumanEval, GSM8K, and MMLU-Pro, spanning coding, mathematics, and reasoning domains. The approach maintains effectiveness even with imperfect verifiers, suggesting practical applicability in real deployment scenarios.
For the AI infrastructure and services industry, ReD has immediate implications for cost optimization. Organizations operating large-scale LLM inference face mounting expenses from token consumption and API calls. This research provides a concrete methodology to reduce those costs without sacrificing output quality, particularly valuable for production environments handling high query volumes. The technique also enables better measurement of model inference characteristics without requiring access to underlying pass@k distributions.
The significance extends to model evaluation methodology itself. As LLMs proliferate across enterprise applications, efficiency metrics become as critical as raw performance metrics. ReD offers developers and researchers a framework for optimizing inference within realistic budget constraints, potentially influencing how future LLM benchmarking standards are established and how models are selected for resource-constrained deployments.
- →ReD reduces computational attempts, tokens, and USD costs required to achieve target coverage levels across multiple LLM benchmarks
- →The method quantitatively predicts cost savings and can infer power-law exponents when pass@k data is unavailable
- →Coverage@cost provides a more realistic evaluation metric than pass@k for budget-constrained deployment scenarios
- →The technique maintains efficiency gains with imperfect verifiers and outperforms existing allocation baselines
- →Findings apply across diverse domains including coding, mathematics, and multi-task reasoning benchmarks