AINeutralarXiv โ CS AI ยท 9h ago6/10
๐ง
RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following
Researchers introduce RubricEval, the first rubric-level meta-evaluation benchmark for assessing how well AI judges evaluate instruction-following in large language models. Even advanced models like GPT-4o achieve only 55.97% accuracy on the challenging subset, highlighting significant gaps in AI evaluation reliability.
๐ง GPT-4