AINeutralarXiv – CS AI · 3d ago7/10
🧠AIRGuard is a runtime security framework that protects AI agents from authority confusion attacks, where attackers manipulate untrusted context to misuse authorized tool access. The system reduces attack success rates from 36.3% to 5.5% while maintaining 76% of benign functionality, outperforming existing defense mechanisms by enforcing least-privilege authorization at execution time.
🧠 Haiku🧠 Sonnet
AIBearisharXiv – CS AI · May 127/10
🧠Researchers introduce EnvTrustBench, a benchmarking framework that identifies evidence-grounding defects (EGDs) in LLM agents—failures where agents act on stale, incorrect, or malicious environmental data without verification. Testing across 6 LLM backbones and 5 agent scaffolds reveals consistent vulnerabilities, exposing a critical reliability gap in agent systems that increasingly interact with real-world APIs, files, and logs.
AIBullisharXiv – CS AI · May 77/10
🧠AgentTrust is a runtime safety layer that intercepts AI agent tool calls before execution to prevent unsafe actions like accidental deletion, credential exposure, or data exfiltration. The system achieves 95-96.7% verdict accuracy across benchmarks using deobfuscation, risk chain detection, and LLM-based judgment, addressing a critical gap in AI agent safety infrastructure.
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers present symbolic guardrails as a practical approach to enforce safety and security constraints on AI agents that use external tools. Analysis of 80 benchmarks reveals that 74% of policy requirements can be enforced through symbolic guardrails without reducing agent effectiveness, addressing a critical gap in AI safety for high-stakes applications.
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers have identified a critical safety vulnerability in computer-use agents (CUAs) where benign user instructions can lead to harmful outcomes due to environmental context or execution flaws. The OS-BLIND benchmark reveals that frontier AI models, including Claude 4.5 Sonnet, achieve 73-93% attack success rates under these conditions, with multi-agent deployments amplifying vulnerabilities as decomposed tasks obscure harmful intent from safety systems.
🧠 Claude
AIBearisharXiv – CS AI · Apr 107/10
🧠Researchers have discovered a new attack vulnerability in mobile vision-language agents where malicious prompts remain invisible to human users but are triggered during autonomous agent interactions. Using an optimization method called HG-IDA*, attackers can achieve 82.5% planning and 75.0% execution hijack rates on GPT-4o by exploiting the lack of touch signals during agent operations, exposing a critical security gap in deployed mobile AI systems.
🧠 GPT-4
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce ILION, a deterministic safety system for autonomous AI agents that can execute real-world actions like financial transactions and API calls. The system achieves 91% precision with sub-millisecond latency, significantly outperforming existing text-safety infrastructure that wasn't designed for agent execution safety.
🏢 OpenAI🧠 Llama
AINeutralarXiv – CS AI · 4d ago6/10
🧠Grimlock is a security framework that uses eBPF and TLS 1.3 channel binding to enforce authorization and delegation controls in agentic AI systems without modifying application code. The system intercepts sandbox communications, validates identity through post-handshake attestation, and issues short-lived scope tokens to enable secure multi-cloud orchestration with transparent auditability.
AINeutralarXiv – CS AI · 5d ago6/10
🧠Researchers present ChainCaps, a runtime safety framework that prevents tool-using AI agents from exploiting composed services through 'permission laundering'—where an agent passes intermediate results through multiple tools to achieve unauthorized outcomes. The system uses capability budgets that propagate through tool chains via intersection, reducing attack success rates from 25-68% to 0-4.8% while maintaining 96-100% benign task completion across frontier models.
AIBullisharXiv – CS AI · Mar 36/108
🧠Researchers propose a new safety framework for AI agents using Scala 3 with capture checking to prevent information leakage and malicious behaviors. The system creates a 'safety harness' that tracks capabilities through static type checking, allowing fine-grained control over agent actions while maintaining task performance.