AgentTrust: A Self-Improving Trust Layer for AI-Agent Actions
AgentTrust v2 introduces a self-improving trust layer for AI agents that distinguishes between lexical (rule-detectable) and semantic (intent-dependent) threats. Using an LLM judge combined with a dual-store system, it achieves 83.6-85.2% accuracy on semantic threats while progressively distilling deterministic rules for lexical threats, demonstrating zero false-blocks across 45,000 test actions.
AgentTrust addresses a critical infrastructure gap in autonomous AI systems. As agents gain permission to execute shell commands, cloud operations, and tool calls, determining which actions to allow becomes a security-critical problem. The research distinguishes between two fundamentally different threat classes: lexical threats (malicious code with recognizable signatures) and semantic threats (benign-looking actions with malicious intent). This taxonomy reveals why static rule-based systems fail—they inherently cannot detect intent-dependent attacks, explaining why hand-authored cloud rules improved accuracy only marginally (48% to 56%) while missing entire threat categories entirely.
The dual-store architecture represents a practical advance in AI security. An LLM judge evaluates each action and, critically, self-improves by distilling successful decisions into cheaper deterministic rules while building a guarded precedent store for semantic cases. The corroboration guard—which prevents surface-identical actions from collapsing to cached verdicts—lifted semantic accuracy by 13 percentage points, demonstrating that precedent-based systems require contextual verification. The online replay results validate the approach: judge call rates dropped from 50% to 44% (reducing cost and latency), semantic accuracy improved from 71% to 80%, and zero benign actions were hard-blocked across 45,000 test cases.
For AI infrastructure, this work establishes that security layers for autonomous agents require learned components, not just rule engines. The self-improving mechanism means trust systems become more efficient over time rather than static. However, the research remains academic; real-world deployment would require integration testing with production agent systems and adversarial evaluation against sophisticated semantic attacks designed to defeat the LLM judge.
- →AgentTrust v2 uses LLM judges to detect semantic threats (intent-based attacks) where traditional rules fail, achieving 83.6-85.2% accuracy.
- →A self-improving dual-store distills deterministic rules from lexical threats while building a guarded precedent cache for semantic threats.
- →The system demonstrated zero false-blocks (benign actions hard-blocked) across 45,000 production-replay actions, critical for user trust.
- →Judge call rates fell 12% (50% to 44%) while semantic accuracy rose 9%, indicating improving efficiency and capability over time.
- →The research proves intent-dependent threat detection requires learned models—static rule packs are architecturally incapable of closing the semantic threat gap.