#prompt-injection News & Analysis

113 articles tagged with #prompt-injection. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

113 articles

AIBearisharXiv – CS AI · Jun 257/10

🧠

AI Snitches Get Glitches: Towards Evading Agentic Surveillance

Researchers introduce 'agentic surveillance'—the ability of AI agents to analyze data and send reports about users without consent—and create SurveilBench to evaluate this risk across models. The study demonstrates that surveillance can already be easily implemented while also developing prompt injection-based evasion techniques, raising urgent calls for technical and legislative safeguards.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Safe to Check, Unsafe to Use: Relinking at the Compression Boundary of LLM Agents

Researchers have identified a critical vulnerability called "relinking" in LLM agents that use compression to handle long contexts. By splitting malicious instructions into benign fragments distributed across text, attackers can bypass security filters that inspect uncompressed prompts, as the compression process reconstructs the complete malicious instruction. Existing defenses fail to catch this attack, though a new KBRA defense eliminates the risk.

AIBearisharXiv – CS AI · Jun 237/10

🧠

MIRAGE: Stealthy Visual Prompt Injection for Vulnerability Detection in Web Agents

Researchers have identified a sophisticated vulnerability in multimodal AI web agents through MIRAGE, a visual prompt injection attack that exploits trusted web platforms by embedding hidden adversarial instructions within legitimate ad slots or widgets. The attack demonstrates how constrained attackers can manipulate MLLM-based automation tools like SeeAct and OpenClaw without detection, raising critical security concerns for AI-powered browser automation systems.

AIBearisharXiv – CS AI · Jun 237/10

🧠

Confidently Wrong: Severity-Aware Calibration of Prompt-Injection Detectors under Attack Shift

Researchers discovered that popular prompt-injection detectors (ProtectAI-v2 and Prompt-Guard-2) maintain extremely high confidence scores even when failing to catch attacks, particularly indirect behavior-hijack injections. Across multiple attack distribution shifts, detectors missed injections with 0.99-1.00 confidence while false-negative rates ranged from 1-97%, indicating a critical calibration failure that standard metrics fail to detect.

AIBearishSimon Willison Blog · Jun 227/10

🧠

Prompt Injection as Role Confusion

The article examines prompt injection attacks as a form of role confusion in AI systems, where malicious inputs manipulate language models into bypassing their intended constraints by exploiting how these models interpret conflicting instructions and contextual switching.

AIBearisharXiv – CS AI · Jun 197/10

🧠

Analyzing the Narration Gap in LLM-Solver Loops

Researchers identify critical vulnerabilities in LLM-solver hybrid systems where formal verification guarantees break down during the narration phase—converting solver outputs to user-readable answers. Testing five open-source models reveals adversaries can manipulate final responses through prompt injection despite underlying formal correctness, indicating safety-critical applications using AI-assisted reasoning require additional safeguards beyond solver verification.

AINeutralarXiv – CS AI · Jun 197/10

🧠

Analyzing Defensive Misdirection Against Model-Guided Automated Attacks on Agentic AI Systems

Researchers demonstrate that conventional detect-and-block defenses against AI jailbreak attacks fail as automated attackers scale their efforts, but a new misdirection strategy called CMPE significantly reduces attack success rates by feeding false positives to attacker judges instead of predictable refusals.

AIBearisharXiv – CS AI · Jun 117/10

🧠

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

Researchers developed AutoInject, a reinforcement learning framework that automatically generates adversarial prompts to exploit LLM agents through prompt injection attacks. The method outperforms existing attack techniques on production models and successfully breaks defenses specifically designed to resist prompt injection, highlighting a significant vulnerability gap in AI system security.

AIBearisharXiv – CS AI · Jun 107/10

🧠

GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines

Researchers present GitInject, a framework demonstrating prompt injection vulnerabilities in AI-powered CI/CD pipelines used by major tech companies. The study reveals that all tested AI providers are susceptible to attacks that could enable credential theft, code manipulation, and supply chain compromise through GitHub workflows.

AIBearisharXiv – CS AI · Jun 107/10

🧠

Assessing Automated Prompt Injection Attacks in Agentic Environments

Researchers have evaluated automated prompt injection attacks against large language model agents using both white-box and black-box optimization methods, finding that black-box approaches significantly outperform gradient-based techniques in realistic agentic settings. While task-universal attacks transfer effectively across domains, attacks trained on smaller models fail to generalize to frontier models like GPT-5, suggesting model-dependent vulnerabilities rather than universal exploits.

🧠 GPT-5

AIBearisharXiv – CS AI · Jun 107/10

🧠

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

A comprehensive review of 247 research papers reveals that LLM agents face escalating security threats beyond text generation, including prompt injection, tool hijacking, and state corruption. The study proposes a framework emphasizing trust boundaries, privilege control, and stateful risk evaluation to address fragmented defenses and inadequate benchmarking standards.

AIBearisharXiv – CS AI · Jun 97/10

🧠

VATS: Exploiting Implicit Authority in Error-Path Injection via Systematic Mutation

Researchers have identified a critical vulnerability in the Model Context Protocol (MCP) used by autonomous AI agents, where error messages can be weaponized to bypass safety guardrails. The VATS framework demonstrates that error-path injection attacks triple the success rate of standard prompt injection techniques, achieving near-perfect compliance rates across leading AI models, though production-level mitigations exist.

🧠 GPT-5🧠 Gemini

AIBearisharXiv – CS AI · Jun 97/10

🧠

Hiding in Plain Floats: Steganographic Carriers for Indirect Prompt and Content Injection

Researchers demonstrated a novel prompt-injection attack that bypasses text-based LLM defenses by encoding malicious payloads as floating-point parameters and reconstructing them as fragmented telemetry. Testing across three commercial LLM APIs showed 94.3% attack success rate against leading defenses like Prompt Guard 2, revealing a critical gap in structured-input security.

AIBearisharXiv – CS AI · Jun 97/10

🧠

Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents

Researchers identify critical security vulnerabilities in brain-computer interface (BCI) systems connected to large language model agents, demonstrating that neural signal perturbations can manipulate tool-use authorization while evading standard safety monitors. The study establishes a formal audit framework to detect and mitigate 'brain-prompt injection' attacks, revealing that current decoder accuracy metrics fail to guarantee route safety in BCI-LLM pipelines.

AIBullisharXiv – CS AI · Jun 87/10

🧠

Zero-Shot Embedding Drift Detection: A Lightweight Defense Against Prompt Injections in LLMs

Researchers introduce Zero-Shot Embedding Drift Detection (ZEDD), a lightweight defense mechanism that detects prompt injection attacks on large language models by measuring semantic shifts in embedding space. The method achieves over 93% accuracy with less than 3% false positives across multiple LLM architectures without requiring model access or task-specific training.

🧠 Llama

AIBearisharXiv – CS AI · Jun 87/10

🧠

When Large Language Models Fail in Healthcare: Evaluating Sensitivity to Prompt Variations

A comprehensive study reveals that both general-purpose and medical-specific large language models exhibit dangerous sensitivity to prompt variations, with even minor rewording capable of altering clinical diagnoses or producing harmful medical advice. The research demonstrates that adversarial manipulations can trigger clinically dangerous outputs such as incorrect dosages, raising serious safety concerns for healthcare AI deployment.

🧠 Llama

AIBearisharXiv – CS AI · Jun 87/10

🧠

How reliable are LLMs when it comes to playing dice?

A comprehensive study of 8 state-of-the-art language models reveals significant limitations in probabilistic reasoning, with accuracy dropping from 96% on standard problems to 59% on counterintuitive ones. The research demonstrates that LLMs are vulnerable to token bias and prompt manipulation, suggesting they lack genuine probability reasoning despite excelling at other mathematical tasks.

AIBearisharXiv – CS AI · Jun 87/10

🧠

It's a TRAP! Task-Redirecting Agent Persuasion Benchmark for Web Agents

Researchers introduce TRAP, a benchmark demonstrating that web-based AI agents are vulnerable to prompt injection attacks hidden in interface elements, with susceptibility rates ranging from 13% to 43% across frontier models. The study reveals that small contextual changes can double attack success rates, exposing systemic security weaknesses in autonomous agents performing real-world tasks like email management and professional networking.

🧠 GPT-5

AIBearishDecrypt · Jun 67/10

🧠

Claude Code Vulnerability Could Let Attackers Steal Credentials From GitHub, Says Microsoft

Microsoft researchers have identified a critical vulnerability in Claude Code where prompt injection attacks could manipulate AI coding agents into exfiltrating sensitive credentials stored in GitHub and development pipelines. This security flaw highlights systemic risks in deploying AI agents with access to production environments and sensitive infrastructure.

🧠 Claude

AIBearisharXiv – CS AI · Jun 57/10

🧠

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

Researchers demonstrate that LLM-based judges used in AI benchmarking are highly vulnerable to manipulation through post-decision interaction, with targeted challenges capable of overturning initial evaluations despite high confidence scores. This vulnerability introduces a critical failure mode in automated evaluation systems that could degrade benchmark reliability and ranking accuracy.

AIBearisharXiv – CS AI · Jun 57/10

🧠

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

Researchers challenge the credibility of recent computer-using agent (CUA) red-teaming studies by reproducing published prompt-injection attacks against frontier models Claude Sonnet 4.6 and GPT-5.4, finding 0% success rates compared to reported 42-98% attack success rates in prior work. The analysis reveals that published high attack success rates depend on reinforcement-learning optimized injection text rather than fundamental attack categories, and that safety hardening is domain-specific to browser interfaces, not generalizable across CUA modalities.

🧠 GPT-5🧠 Claude🧠 Sonnet

AIBearisharXiv – CS AI · Jun 57/10

🧠

SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks

Researchers introduce SlotGCG, a novel jailbreak attack method that exploits positional vulnerabilities in large language models by strategically inserting adversarial tokens at optimal positions within prompts rather than just at the end. The approach achieves 14% higher success rates than existing GCG-based attacks while identifying that LLM vulnerability is significantly dependent on token insertion location.

AIBearisharXiv – CS AI · Jun 47/10

🧠

What If Prompt Injection Never Left? Exploring Cross-Session Stored Prompt Injection in Agentic Systems

Researchers have identified a critical security vulnerability in agentic AI systems called cross-session stored prompt injection, where malicious instructions can persist within system state and compromise future interactions long after the attacker disconnects. This threat fundamentally differs from traditional prompt injection by leveraging long-lived system artifacts like memories and filesystems, transforming ephemeral model-level attacks into durable system-level vulnerabilities that accumulate over time.

AIBearisharXiv – CS AI · Jun 47/10

🧠

Caught in the Act(ivation): Toward Pre-Output and Multi-Turn Detection of Credential Exfiltration by LLM Agents

Researchers demonstrate that LLM agents are vulnerable to credential exfiltration attacks when sensitive data shares context windows with untrusted content, enabling indirect prompt injection. The study proposes three defense mechanisms: activation probes for pre-output detection, honeytokens with calibrated thresholds, and multi-turn leakage accounting to prevent cumulative credential theft across conversations.

AIBearisharXiv – CS AI · Jun 47/10

🧠

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

Researchers have identified systematic vulnerabilities in LLM-based AI agents that enable memory poisoning attacks, where adversaries inject malicious data into persistent memory to manipulate long-term agent behavior. The study reveals four memory write channels and nine structural vulnerabilities across system design, with existing security defenses proving ineffective against this threat vector.

Page 1 of 5Next →