When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
Researchers introduce Propagational Proxy Voting (PPV), an unsupervised aggregation method for multi-sample LLM inference that outperforms standard majority voting on MMLU-Pro benchmarks by leveraging semantic entropy and reasoning geometry signals. The method achieves +1.5 percentage point overall improvement and +2.24 pp on difficult questions without requiring labeled data or auxiliary training.
This research addresses a fundamental inefficiency in how large language models aggregate multiple sampled outputs. Traditional majority voting treats each sample as a binary vote, discarding rich information about model confidence and reasoning consistency. The PPV approach captures two previously ignored signals: within-sample semantic entropy (how confidently a model expresses its answer) and between-sample geometric coherence (whether reasoning paths align in embedding space). The method partitions 128 samples into 16 groups, computing semantic entropy and embedding centroids to construct a stochastic delegation matrix that dynamically weights voter influence based on these signals. Statistically significant improvements on MMLU-Pro (p ~ 1.0e-14) demonstrate the approach's robustness. The research demonstrates a practical scenario where PPV overturns a 10-6 majority vote by recognizing that the minority cluster exhibits geometric coherence (+0.26 cosine similarity) while the majority cluster is incoherent (-0.02), indicating the minority reasoning is more internally consistent. This finding has implications for production LLM systems where inference-time sampling is computationally expensive. Better aggregation methods directly improve cost-efficiency by extracting more signal from each forward pass. The negative results—showing that confidence-based ensemble methods cannot close the gap to oracle performance—help establish principled boundaries for unsupervised aggregation research. This work is particularly relevant as practitioners increasingly use multi-sample inference to improve LLM reliability without fine-tuning, making aggregation method efficiency a key competitive advantage.
- →PPV improves upon majority voting by +1.5-2.24 percentage points on MMLU-Pro by incorporating semantic entropy and reasoning geometry signals
- →The method requires no labeled data, auxiliary training, or external models, making it practical for deployment in existing inference pipelines
- →Geometric incoherence of majority clusters can indicate incorrect consensus, revealing cases where minority reasoning is more internally consistent
- →Research identifies fundamental limits: confidence-based modes cannot fully close the gap to oracle performance in unsupervised aggregation
- →The approach extracts additional signal from existing multi-sample inference computations, improving cost-efficiency without additional forward passes