AINeutralarXiv โ CS AI ยท 5h ago6/10
๐ง
Position: Science of AI Evaluation Requires Item-level Benchmark Data
Researchers argue that current AI evaluation methods have systemic validity failures and propose item-level benchmark data as essential for rigorous AI evaluation. They introduce OpenEval, a repository of item-level benchmark data to support evidence-centered AI evaluation and enable fine-grained diagnostic analysis.