y0news
AnalyticsDigestsSourcesRSSAICrypto
#ranking-algorithms1 article
1 articles
AINeutralarXiv โ€“ CS AI ยท 7h ago6/10
๐Ÿง 

When LLM Judge Scores Look Good but Best-of-N Decisions Fail

Research reveals that large language models used as judges for scoring responses show misleading performance when evaluated by global correlation metrics versus actual best-of-n selection tasks. A study using 5,000 prompts found that judges with moderate global correlation (r=0.47) only captured 21% of potential improvement, primarily due to poor within-prompt ranking despite decent overall agreement.