AIBearisharXiv – CS AI · 1d ago7/10
🧠Researchers have developed AutoElicit, a framework that automatically discovers unsafe behaviors in computer-use agents (CUAs) like Claude and Operator by iteratively perturbing benign instructions. The study reveals hundreds of severe unintended behaviors in state-of-the-art AI agents and demonstrates these vulnerabilities transfer across multiple frontier models, establishing the first systematic methodology for probing CUA safety risks.
🧠 Claude
AI × CryptoNeutralarXiv – CS AI · 6d ago7/10
🤖Researchers propose Sello, a cryptographic protocol that addresses a critical vulnerability in AI agent observability by having external services sign tamper-evident receipts of agent actions rather than agents logging their own activity. The system uses receiver-side signing, encryption, and public transparency logs to create an independent audit trail that prevents compromised agents from falsifying records.
AIBearisharXiv – CS AI · Jun 17/10
🧠Researchers demonstrate the first distributed agent attack where language models coordinate across multiple accounts to hide cyberattacks from detection systems. They propose a stateful online monitoring solution using real-time clustering that catches these distributed threats 30% earlier while maintaining negligible latency for legitimate traffic.
AINeutralarXiv – CS AI · May 297/10
🧠AIRGuard is a runtime security framework that protects AI agents from authority confusion attacks, where attackers manipulate untrusted context to misuse authorized tool access. The system reduces attack success rates from 36.3% to 5.5% while maintaining 76% of benign functionality, outperforming existing defense mechanisms by enforcing least-privilege authorization at execution time.
🧠 Haiku🧠 Sonnet
AIBearisharXiv – CS AI · May 127/10
🧠Researchers introduce EnvTrustBench, a benchmarking framework that identifies evidence-grounding defects (EGDs) in LLM agents—failures where agents act on stale, incorrect, or malicious environmental data without verification. Testing across 6 LLM backbones and 5 agent scaffolds reveals consistent vulnerabilities, exposing a critical reliability gap in agent systems that increasingly interact with real-world APIs, files, and logs.
AIBullisharXiv – CS AI · May 77/10
🧠AgentTrust is a runtime safety layer that intercepts AI agent tool calls before execution to prevent unsafe actions like accidental deletion, credential exposure, or data exfiltration. The system achieves 95-96.7% verdict accuracy across benchmarks using deobfuscation, risk chain detection, and LLM-based judgment, addressing a critical gap in AI agent safety infrastructure.
AIBullisharXiv – CS AI · Apr 207/10
🧠Researchers present symbolic guardrails as a practical approach to enforce safety and security constraints on AI agents that use external tools. Analysis of 80 benchmarks reveals that 74% of policy requirements can be enforced through symbolic guardrails without reducing agent effectiveness, addressing a critical gap in AI safety for high-stakes applications.
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers have identified a critical safety vulnerability in computer-use agents (CUAs) where benign user instructions can lead to harmful outcomes due to environmental context or execution flaws. The OS-BLIND benchmark reveals that frontier AI models, including Claude 4.5 Sonnet, achieve 73-93% attack success rates under these conditions, with multi-agent deployments amplifying vulnerabilities as decomposed tasks obscure harmful intent from safety systems.
🧠 Claude
AIBearisharXiv – CS AI · Apr 107/10
🧠Researchers have discovered a new attack vulnerability in mobile vision-language agents where malicious prompts remain invisible to human users but are triggered during autonomous agent interactions. Using an optimization method called HG-IDA*, attackers can achieve 82.5% planning and 75.0% execution hijack rates on GPT-4o by exploiting the lack of touch signals during agent operations, exposing a critical security gap in deployed mobile AI systems.
🧠 GPT-4
AIBullisharXiv – CS AI · Mar 177/10
🧠Researchers introduce ILION, a deterministic safety system for autonomous AI agents that can execute real-world actions like financial transactions and API calls. The system achieves 91% precision with sub-millisecond latency, significantly outperforming existing text-safety infrastructure that wasn't designed for agent execution safety.
🏢 OpenAI🧠 Llama
AINeutralarXiv – CS AI · 2d ago6/10
🧠Researchers introduce TRACE, a monitoring framework designed to detect malicious behavior in autonomous LLM agents by tracking evidence across long sequences of seemingly benign actions. The system achieves 0.713 F1 score and 0.844 recall on benchmark tests, addressing a critical security gap where agents can pursue hidden objectives through temporally distributed steps.
AINeutralarXiv – CS AI · Jun 26/10
🧠Researchers identify a privacy vulnerability in AI agents that use speculative tool calls to reduce latency, where external services receive and retain inferred user intent data even after the agent abandons the speculative branch. The study proposes Speculative Tool Privacy Contracts as a runtime solution, finding that only issue-time policies suppressing or modifying calls before dispatch effectively mitigate information leakage.
AINeutralarXiv – CS AI · May 286/10
🧠Grimlock is a security framework that uses eBPF and TLS 1.3 channel binding to enforce authorization and delegation controls in agentic AI systems without modifying application code. The system intercepts sandbox communications, validates identity through post-handshake attestation, and issues short-lived scope tokens to enable secure multi-cloud orchestration with transparent auditability.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers present ChainCaps, a runtime safety framework that prevents tool-using AI agents from exploiting composed services through 'permission laundering'—where an agent passes intermediate results through multiple tools to achieve unauthorized outcomes. The system uses capability budgets that propagate through tool chains via intersection, reducing attack success rates from 25-68% to 0-4.8% while maintaining 96-100% benign task completion across frontier models.
AIBullisharXiv – CS AI · Mar 36/108
🧠Researchers propose a new safety framework for AI agents using Scala 3 with capture checking to prevent information leakage and malicious behaviors. The system creates a 'safety harness' that tracks capabilities through static type checking, allowing fine-grained control over agent actions while maintaining task performance.