🧠 AI⚪ NeutralImportance 7/10

Policy-Invisible Violations in LLM-Based Agents

arXiv – CS AI|Jie Wu, Ming Gong|April 15, 2026 at 04:00 AM

🤖AI Summary

Researchers identified a critical failure mode in LLM-based agents called policy-invisible violations, where agents execute actions that appear compliant but breach organizational policies due to missing contextual information. They introduced PhantomPolicy, a benchmark with 600 test cases, and Sentinel, an enforcement framework using counterfactual graph simulation that achieved 93% accuracy in detecting violations compared to 68.8% for baseline approaches.

Analysis

LLM-based agents represent a significant shift in how organizations automate decision-making, yet this research exposes a fundamental vulnerability in their deployment. Policy-invisible violations occur when agents lack access to critical contextual facts needed to judge compliance—such as user history, entity attributes, or organizational state—despite having syntactically valid and semantically appropriate instructions. This distinction matters because it reveals that traditional safety approaches focusing on language content and intent validation miss an entire category of harmful behaviors.

The research addresses a gap between laboratory AI safety work and production requirements. Most LLM safety research concentrates on preventing harmful outputs or misaligned instructions, but this work demonstrates that harm can arise from incomplete information states rather than malicious intent. The PhantomPolicy benchmark spanning eight violation categories provides empirical grounding, while the manual review finding that 5.3% of labels changed upon trace-level inspection underscores how easily automated evaluation can miss nuanced failures.

Sentinel's counterfactual graph simulation approach represents a architectural shift toward world-state-grounded enforcement. By treating agent actions as mutations to an organizational knowledge graph and verifying structural invariants, it achieves substantially higher accuracy than content-only detection. This framework suggests that enterprise AI deployment requires embedding agents within richer information ecosystems where policy enforcement has visibility into both actions and consequences.

The implications extend beyond academic interest. Organizations deploying autonomous agents in high-stakes domains—finance, healthcare, access control—face serious liability if policy violations occur despite good-faith agent behavior. This research indicates that effective governance requires not just better prompting or training, but architectural changes ensuring enforcement systems access the same contextual information agents use for decisions.

Key Takeaways

→Policy-invisible violations occur when agents lack access to contextual information needed for compliance judgment, despite technically correct behavior.
→Manual trace-level review changed 5.3% of labels, indicating automated evaluation significantly underestimates violation risks in LLM agents.
→Sentinel's world-state-aware enforcement achieved 93% accuracy versus 68.8% for content-only baselines, demonstrating the value of contextual information.
→Enterprise AI governance requires architectural changes embedding enforcement systems within organizational knowledge graphs, not just improved prompting.
→This research identifies a critical gap between AI safety research and production deployment requirements for autonomous agent systems.