Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
Researchers demonstrate that LLM-based safety judges for AI agents fail a critical reliability test: they produce inconsistent verdicts based on how evaluation policies are worded rather than what agents actually do. The study reveals that up to 9.1% of safety judgments flip when policies are rewritten with identical meaning, undermining the trustworthiness of current AI safety benchmarks.
Current AI safety evaluation relies heavily on large language models acting as judges to assess whether autonomous agents behave safely. This research exposes a fundamental flaw in this approach: these judges conflate agent behavior with evaluator prompting artifacts. The study tested four judge models against trajectories from established benchmarks, discovering that semantically identical policy rewrites and trivial structural changes produced verdict flips comparable in magnitude to meaningful policy shifts, indicating judges cannot distinguish between substantive normative changes and meaningless prompt variations.
This problem stems from treating LLM verdicts as ground-truth proxies without verification. Safety evaluation pipelines have proliferated as agent complexity increases, but their reliability has never been stress-tested for robustness. The research operationalizes three testable principles—rubric-semantics invariance, rubric-threshold invariance, and ambiguity-aware calibration—to expose this vulnerability systematically. Critically, 18-43% of observed verdict flips occur on genuinely unambiguous cases, meaning safety scores currently conflate algorithmic performance with evaluation methodology.
For the AI safety ecosystem, this research has profound implications. Organizations building agents for high-stakes domains—autonomous systems, content moderation, financial operations—may be deploying models validated by unreliable judges. The identified order-of-magnitude spread in judge reliability across models would be invisible to accuracy-only leaderboards. The authors contribute concrete remedies: the Policy Invariance Score and Judge Card reporting protocol enable future benchmarks to audit evaluator reliability rather than assume trustworthiness by default. This work signals a necessary maturation phase in AI safety evaluation, shifting from blind trust in LLM judges toward systematic reliability testing before deployment.
- →LLM safety judges flip up to 9.1% of verdicts on semantically identical policy rewrites, conflating agent behavior with evaluation prompting.
- →Current safety benchmarks treat judge verdicts as ground-truth without verifying the judges are actually measuring agent behavior rather than prompt sensitivity.
- →18-43% of verdict flips occur on unambiguous cases under content-preserving rewrites, indicating judges lack robustness on clear-cut scenarios.
- →The Policy Invariance Score and Judge Card protocol enable transparent auditing of evaluator reliability, exposing order-of-magnitude spread invisible to accuracy metrics.
- →This research indicates existing AI safety scores may systematically overestimate agent reliability by relying on unreliable evaluation judges.