AINeutralarXiv โ CS AI ยท 7h ago6/10
๐ง
When LLM Judge Scores Look Good but Best-of-N Decisions Fail
Research reveals that large language models used as judges for scoring responses show misleading performance when evaluated by global correlation metrics versus actual best-of-n selection tasks. A study using 5,000 prompts found that judges with moderate global correlation (r=0.47) only captured 21% of potential improvement, primarily due to poor within-prompt ranking despite decent overall agreement.