Semantic Voting: Execution-Grounded Consensus for LLM Code Generation
Researchers demonstrate that execution-based voting methods for LLM code generation significantly outperform text-based majority voting by 18-52 percentage points. The study reveals that input quality—particularly sketch-based generation—matters far more than the aggregation algorithm itself, challenging assumptions about how to select optimal code outputs.
This research addresses a fundamental problem in AI-assisted development: when large language models generate multiple code candidates, how do you select the best one without access to ground truth? The paper's core contribution is reframing this as a signal-quality problem rather than an algorithmic one, with profound implications for AI reliability and deployment.
The findings reshape understanding of LLM code selection. Execution-based methods that test candidates on diverse inputs dramatically outperform simple textual voting, yet the choice between SemanticVote, weighted voting, and MBR-Exec barely matters statistically. This suggests developers should focus engineering effort on generating high-quality test inputs rather than optimizing selection algorithms. Sketch-based input generation—structured, LLM-assisted test creation—consistently outperforms both pure LLM generation and random fuzzing, delivering measurable gains of 0.6-2.1 percentage points and up to 11.3 points versus fuzzing.
The interaction between model reasoning depth and selection method reveals a critical tradeoff: while deeper thinking (chain-of-thought reasoning) improves majority voting by 12 percentage points, it actually hurts or stagnates execution-based methods. This paradox stems from reduced candidate diversity when models think more deeply, limiting execution-based selectors' ability to discriminate.
For AI development teams, the implications are clear: invest in robust test input generation rather than complex voting schemes. The research validates execution-based verification as a practical oracle substitute when formal validation is unavailable. This work accelerates the path toward production-grade AI-assisted coding by identifying which optimization targets yield meaningful improvements.
- →Execution-based code selection outperforms text-based voting by 18-52 percentage points across all tested configurations.
- →Input quality is the dominant factor—sketch-based test generation outperforms direct LLM generation by 0.6-2.1 points.
- →Aggregation algorithm choice has negligible impact once candidates are executed on diverse inputs.
- →Deeper model reasoning improves majority voting but degrades execution-based methods due to reduced candidate diversity.
- →Treating code selection as a signal-quality problem rather than an aggregation problem yields better practical results.