🧠 AI🟢 BullishImportance 6/10

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

arXiv – CS AI|Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei|February 27, 2026 at 05:00 AM|7 views

🤖AI Summary

Researchers introduce Duel-Evolve, a new optimization algorithm that improves LLM performance at test time without requiring external rewards or labels. The method uses self-generated pairwise comparisons and achieved 20 percentage points higher accuracy on MathBench and 12 percentage points improvement on LiveCodeBench.

Key Takeaways

→Duel-Evolve optimizes LLM outputs using pairwise preferences from the same LLM instead of external scalar rewards.
→The algorithm achieved 20 percentage points higher accuracy on MathBench compared to existing methods.
→Performance improved by over 12 percentage points on LiveCodeBench versus comparable iterative methods.
→The method requires no reward model, ground-truth labels, or hand-crafted scoring functions during optimization.
→Uses Bayesian Bradley-Terry model and Double Thompson Sampling to guide the optimization process.