y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#measurement-validity News & Analysis

1 article tagged with #measurement-validity. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

1 articles
AINeutralarXiv – CS AI · 7h ago6/10
🧠

Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory

Researchers introduce a diagnostic framework using Item Response Theory (IRT) to assess the reliability of Large Language Models used as automated judges. The framework evaluates LLM judges on two dimensions: intrinsic consistency (stability under prompt variations) and human alignment (correspondence with human assessments), providing practical guidance for identifying unreliability sources.