AINeutralarXiv – CS AI · 7h ago6/10
🧠
PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges
Researchers introduce PReMISE, a framework for auditing and improving rubrics used by LLM judges to evaluate open-ended responses. The work reveals that existing rubrics—whether raw or human-created—fail to simultaneously achieve reliability, preference alignment, and adversarial robustness, with implications for how AI systems measure quality at scale.