AINeutralarXiv – CS AI · Mar 276/10
🧠
Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
Researchers introduce a new nonparametric method called signed isotonic R² for efficiently detecting problematic items in AI benchmarks and assessments. The method outperforms traditional diagnostic techniques across major AI datasets including GSM8K and MMLU, offering a lightweight solution for improving evaluation quality.