←Back to feed
🧠 AI🟢 BullishImportance 6/10
Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences
arXiv – CS AI|Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei||7 views
🤖AI Summary
Researchers introduce Duel-Evolve, a new optimization algorithm that improves LLM performance at test time without requiring external rewards or labels. The method uses self-generated pairwise comparisons and achieved 20 percentage points higher accuracy on MathBench and 12 percentage points improvement on LiveCodeBench.
Key Takeaways
- →Duel-Evolve optimizes LLM outputs using pairwise preferences from the same LLM instead of external scalar rewards.
- →The algorithm achieved 20 percentage points higher accuracy on MathBench compared to existing methods.
- →Performance improved by over 12 percentage points on LiveCodeBench versus comparable iterative methods.
- →The method requires no reward model, ground-truth labels, or hand-crafted scoring functions during optimization.
- →Uses Bayesian Bradley-Terry model and Double Thompson Sampling to guide the optimization process.
#llm#optimization#test-time-scaling#self-preference#evolutionary-algorithm#machine-learning#research#performance-improvement
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Related Articles