AIBearisharXiv – CS AI · 6h ago7/10
🧠
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
Researchers demonstrate that LLM-based safety judges for AI agents fail a critical reliability test: they produce inconsistent verdicts based on how evaluation policies are worded rather than what agents actually do. The study reveals that up to 9.1% of safety judgments flip when policies are rewritten with identical meaning, undermining the trustworthiness of current AI safety benchmarks.