Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks
A new study reveals that standard single-run accuracy metrics for large language models significantly overstate their real-world reliability on programming tasks, with gaps reaching 17.8 percentage points when measuring consistency across repeated invocations. The research introduces a repeated-run evaluation protocol showing that while popular benchmarks emphasize one-time success rates, deployment environments require stable outputs—a critical distinction that current evaluation standards overlook.
The disconnect between theoretical performance and practical reliability represents a fundamental gap in how the AI community measures LLM capabilities. This research exposes a critical blind spot: models appearing nearly equivalent on standard benchmarks can exhibit dramatically different stability profiles, causing rankings to shift substantially when consistency matters. The study's 16,000 evaluation instances across 16 models demonstrate that the accuracy-stability relationship is neither uniform nor predictable, particularly for mid-tier performers where the gap proves largest.
This finding emerges as LLMs increasingly move from research contexts into production systems where deterministic behavior matters. Users relying on code generation, API integrations, or mission-critical applications need reproducible outputs, not probabilistic successes. Current benchmarking practices—emphasizing either single-run accuracy or eventual success through repeated sampling—miss this operational reality entirely. The research also identifies that prompt engineering effects are highly model-dependent, suggesting no universal optimization strategy exists across provider families.
For developers and organizations deploying LLMs, this work signals the need for comprehensive stability testing before production rollouts. The 17.8 percentage point gap means a model claiming 85% accuracy might deliver only 67% consistent performance—a difference that directly impacts deployment viability. The strong correlation between run-level pass rate and perfect stability (r=0.985) provides a practical metric for deeper evaluation, though practitioners must now invest in repeated-run testing protocols rather than relying on published benchmark numbers alone.
Future LLM evaluation standards should incorporate stability metrics alongside accuracy measurements, establishing consistency thresholds for different deployment contexts rather than treating all applications equivalently.
- →Single-run accuracy metrics overstate LLM reliability by up to 17.8 percentage points compared to actual retry-free coverage rates
- →Mid-performing models show the largest discrepancies between theoretical and practical stability, potentially reversing competitive rankings
- →Stability across repeated invocations is critical for production deployments but remains largely unmeasured in standard benchmarks
- →Prompt engineering effects vary significantly across models, indicating no universal optimization strategy improves performance uniformly
- →Current LLM evaluation protocols are insufficient for deterministic deployment scenarios requiring consistent outputs