AIBearisharXiv – CS AI · 6h ago6/10
🧠
RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning
Researchers introduce RealMath-Eval, a benchmark revealing that state-of-the-art LLM judges fail to accurately evaluate authentic student mathematical reasoning, performing significantly worse on real exam responses (MSE ~2.96) than on synthetic LLM-generated solutions (MSE ~1.17). The study identifies an "Evaluation Gap" stemming from human errors occupying a more diverse semantic space than the predictable patterns found in synthetic errors.