y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#model-assessment News & Analysis

28 articles tagged with #model-assessment. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

28 articles
AINeutralarXiv – CS AI · Feb 276/104
🧠

Correcting Human Labels for Rater Effects in AI Evaluation: An Item Response Theory Approach

Researchers propose using psychometric modeling to correct systematic biases in human evaluations of AI systems, demonstrating how Item Response Theory can separate true AI output quality from rater behavior inconsistencies. The approach was tested on OpenAI's summarization dataset and showed improved reliability in measuring AI model performance.

AINeutralHugging Face Blog · Feb 25/108
🧠

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

NPHardEval Leaderboard introduces a new evaluation framework for assessing large language models' reasoning capabilities through computational complexity classes with dynamic updates. The leaderboard aims to provide more rigorous testing of LLM reasoning abilities by incorporating problems from different complexity categories.

← PrevPage 2 of 2