AINeutralarXiv โ CS AI ยท 8h ago6/10
๐ง
Efficient Detection of Bad Benchmark Items with Novel Scalability Coefficients
Researchers introduce a new nonparametric method called signed isotonic Rยฒ for efficiently detecting problematic items in AI benchmarks and assessments. The method outperforms traditional diagnostic techniques across major AI datasets including GSM8K and MMLU, offering a lightweight solution for improving evaluation quality.