AINeutralarXiv – CS AI · 14h ago6/10
🧠
From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges
Researchers introduce Rulers, a three-stage framework that improves how large language models evaluate text against human rubrics by converting qualitative criteria into locked specifications, structured checklists with evidence grounding, and calibrated score interpretation. The approach addresses three key failure modes in LLM-based scoring and demonstrates stronger alignment with human scoring across multiple benchmarks in essay evaluation, summarization, and writing assessment.