y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 6/10

Duel-Evolve: Reward-Free Test-Time Scaling via LLM Self-Preferences

arXiv – CS AI|Sweta Karlekar, Carolina Zheng, Magnus Saebo, Nicolas Beltran-Velez, Shuyang Yu, John Bowlan, Michal Kucer, David Blei||7 views
🤖AI Summary

Researchers introduce Duel-Evolve, a new optimization algorithm that improves LLM performance at test time without requiring external rewards or labels. The method uses self-generated pairwise comparisons and achieved 20 percentage points higher accuracy on MathBench and 12 percentage points improvement on LiveCodeBench.

Key Takeaways
  • Duel-Evolve optimizes LLM outputs using pairwise preferences from the same LLM instead of external scalar rewards.
  • The algorithm achieved 20 percentage points higher accuracy on MathBench compared to existing methods.
  • Performance improved by over 12 percentage points on LiveCodeBench versus comparable iterative methods.
  • The method requires no reward model, ground-truth labels, or hand-crafted scoring functions during optimization.
  • Uses Bayesian Bradley-Terry model and Double Thompson Sampling to guide the optimization process.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles