Efficient Ensemble Selection from Binary and Pairwise Feedback
Researchers present new algorithms for efficiently selecting small, high-performing ensembles of AI systems using minimal model evaluations. The work addresses both binary feedback (correct/incorrect outcomes) and pairwise feedback (preference comparisons), providing theoretical guarantees and practical query-saving methods validated through LLM experiments.
This research tackles a practical challenge facing organizations deploying multiple AI systems: selecting optimal subsets without exhaustive evaluation costs. The problem is formulated as a distributional voting variant where tasks sample from unknown domains and ensemble value depends on the best-performing member. For binary feedback scenarios, the authors establish matching upper and lower bounds on query complexity while introducing a failure-conditioned greedy algorithm that maintains the classical (1-1/e) approximation guarantee with instance-dependent savings. The pairwise setting proves more complex; the team demonstrates that while full-information optimization admits a PTAS (polynomial-time approximation scheme), it lacks an EPTAS under standard complexity assumptions, indicating fundamental limits. Critically, they show the objective is monotone but not submodular, which would typically enable greedy approximations. To overcome this, they propose a weighted ordinal coverage relaxation that restores submodularity and enables practical oracle-based selection. The work bridges theory and practice through finite-family auditing and minimax wrapper techniques that convert relaxation solutions into theta-type performance guarantees. Small-scale LLM experiments validate predicted query savings and demonstrate that ensemble complementarity—how well different models offset each other's weaknesses—significantly impacts selection efficiency. This research advances the foundations of multi-agent AI system deployment, reducing computational overhead for organizations balancing performance against evaluation costs.
- →Failure-conditioned greedy algorithms achieve standard approximation guarantees while reducing query complexity on average-case instances
- →Pairwise feedback optimization admits PTAS but not EPTAS, establishing theoretical hardness under Gap-ETH
- →Weighted ordinal coverage relaxation enables submodular optimization when direct objectives lack this property
- →Ensemble complementarity significantly influences selection efficiency and committee performance
- →LLM experiments confirm theoretical predictions about query savings in practical AI system selection