AINeutralarXiv – CS AI · Mar 276/10
🧠
RubricEval: A Rubric-Level Meta-Evaluation Benchmark for LLM Judges in Instruction Following
Researchers introduce RubricEval, the first rubric-level meta-evaluation benchmark for assessing how well AI judges evaluate instruction-following in large language models. Even advanced models like GPT-4o achieve only 55.97% accuracy on the challenging subset, highlighting significant gaps in AI evaluation reliability.
🧠 GPT-4