AINeutralarXiv – CS AI · 7h ago6/10
🧠
Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory
Researchers introduce a diagnostic framework using Item Response Theory (IRT) to assess the reliability of Large Language Models used as automated judges. The framework evaluates LLM judges on two dimensions: intrinsic consistency (stability under prompt variations) and human alignment (correspondence with human assessments), providing practical guidance for identifying unreliability sources.