y0news
AnalyticsDigestsSourcesRSSAICrypto
#openeval1 article
1 articles
AINeutralarXiv โ€“ CS AI ยท 5h ago6/10
๐Ÿง 

Position: Science of AI Evaluation Requires Item-level Benchmark Data

Researchers argue that current AI evaluation methods have systemic validity failures and propose item-level benchmark data as essential for rigorous AI evaluation. They introduce OpenEval, a repository of item-level benchmark data to support evidence-centered AI evaluation and enable fine-grained diagnostic analysis.