#model-assessment News & Analysis

28 articles tagged with #model-assessment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

28 articles

AINeutralarXiv – CS AI · Feb 276/104

🧠

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Researchers propose using psychometric modeling to correct systematic biases in human evaluations of AI systems, demonstrating how Item Response Theory can separate true AI output quality from rater behavior inconsistencies. The approach was tested on OpenAI's summarization dataset and showed improved reliability in measuring AI model performance.

AINeutralHugging Face Blog · Apr 165/107

🧠

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

LiveCodeBench introduces a new leaderboard for evaluating code-focused Large Language Models (LLMs) with an emphasis on holistic assessment and contamination-free testing. The benchmark aims to provide more accurate and reliable evaluation of AI coding capabilities by addressing common issues in existing evaluation methods.

AINeutralHugging Face Blog · Feb 25/108

🧠

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

NPHardEval Leaderboard introduces a new evaluation framework for assessing large language models' reasoning capabilities through computational complexity classes with dynamic updates. The leaderboard aims to provide more rigorous testing of LLM reasoning abilities by incorporating problems from different complexity categories.

← PrevPage 2 of 2