PReMISE: Policy Rubrics as Measurement Specifications for LLM Judges
Researchers introduce PReMISE, a framework for auditing and improving rubrics used by LLM judges to evaluate open-ended responses. The work reveals that existing rubrics—whether raw or human-created—fail to simultaneously achieve reliability, preference alignment, and adversarial robustness, with implications for how AI systems measure quality at scale.
The reliability of LLM-based evaluation systems hinges on the quality of their rubrics, yet this critical component has received limited systematic scrutiny. PReMISE addresses a fundamental problem in AI evaluation: vague or poorly-specified rubrics allow language models to reward superficially polished but factually incorrect responses, misaligning automated scoring with actual user preferences. This matters because LLM judges increasingly serve as gatekeepers for model selection, content moderation, and quality assurance across applications.
The research builds on growing concerns about automation bias in AI evaluation. As organizations scale their operations, human review becomes impractical, forcing reliance on automated metrics. However, the paper demonstrates that no single rubric source simultaneously optimizes for structural adequacy, reliability, preference fit, and adversarial robustness—a finding that challenges the assumption that high inter-rater agreement ensures robust evaluation.
The practical impact extends across the AI development pipeline. Teams training or selecting models depend on accurate feedback signals; flawed rubrics introduce systematic biases that compound through development cycles. The proposed repair operations—preference-rank selection and reliability-constrained refinement—deliver measurable improvements, raising judge accuracy to 68.6% and reducing exploitable responses from 46.4% to 36.0%.
Looking forward, the framework's applicability to cross-domain rubric transfer and its effectiveness in preventing adversarial gaming remain open questions. Organizations deploying LLM judges should audit their rubrics through PReMISE's four-axis framework rather than assuming inter-rater agreement indicates robustness. The work suggests that measurement specification—treating rubrics as formal artifacts deserving rigorous engineering—will become essential infrastructure for reliable AI evaluation.
- →No existing rubric source simultaneously achieves reliability, preference alignment, and adversarial robustness in LLM evaluation.
- →High inter-rater agreement does not guarantee that rubrics resist adversarial exploitation or poor-quality detection.
- →PReMISE's preference-rank selection improves judge accuracy to 68.6%, competitive with strongest baseline methods.
- →Reliability-constrained refinement reduces the rate of exploitable responses receiving high scores from 46.4% to 36.0%.
- →Treating rubrics as formal measurement specifications rather than informal guidelines is critical for scalable AI evaluation.