AINeutralarXiv – CS AI · Apr 76/10
🧠
Position: Science of AI Evaluation Requires Item-level Benchmark Data
Researchers argue that current AI evaluation methods have systemic validity failures and propose item-level benchmark data as essential for rigorous AI evaluation. They introduce OpenEval, a repository of item-level benchmark data to support evidence-centered AI evaluation and enable fine-grained diagnostic analysis.