Depth-Dependent Indirect Prompt Injection in Tool-Calling ReAct Agents: Injection Depth, Payload Framing, and Turn-Budget Sensitivity
Researchers identified that indirect prompt injection attacks against ReAct AI agents succeed at dramatically different rates depending on where malicious payloads appear in tool sequences, with success rates dropping from 60% at the first tool observation to 0% at deeper positions. The study reveals that payload framing and conversation turn limits have minimal impact on attack success, making injection depth the critical vulnerability factor for AI agent systems handling real-world tasks.
This research exposes a significant asymmetry in how modern AI agents defend against adversarial inputs, specifically in ReAct systems that combine reasoning with tool-calling. The team systematically tested indirect prompt injection—where attackers embed malicious instructions in tool responses—across multiple dimensions that prior benchmarks ignored. Their findings demonstrate that GPT-4o-mini becomes progressively resistant to injection attacks as payloads appear later in tool sequences, achieving complete failure by depth 4-5, while Claude Haiku exhibits consistent resistance regardless of injection position. This depth-dependency pattern stems from two distinct mechanisms: early resistance where models actively reject malicious instructions, and late-stage completion where agents finish their primary task before encountering the payload. The framing study, testing how rhetorical style affects injection success, showed tantalizing variation between 25% and 75% success rates but lacked statistical power to confirm significance. Turn budgets—the number of interactions allowed—showed no meaningful impact on vulnerability, suggesting agents don't accumulate susceptibility over conversation length. The practical insight that sanitizing only the first tool observation would capture two-thirds of attacks offers developers an immediate hardening strategy, though this may represent a false economy if sophisticated attackers learn to exploit deeper positions. For AI safety practitioners and companies deploying autonomous agents in production environments handling scheduling, file systems, or data access, these findings suggest injection depth is the dominant risk lever requiring careful attention during threat modeling.
- →Indirect prompt injection success against GPT-4o-mini drops from 60% at depth 1 to 0% at depths 4-5, establishing injection depth as the primary vulnerability factor.
- →Claude Haiku demonstrated consistent resistance across all injection depths, suggesting different models have fundamentally different defense mechanisms.
- →Payload framing effects varied widely (25-75% success range) but lacked statistical significance at current sample size, requiring larger studies to confirm rhetorical manipulation risk.
- →Sanitizing only the first tool observation would mitigate approximately 67% of measured injection attacks, offering developers a straightforward initial hardening approach.
- →Turn budget and conversation length showed no correlation with injection success, contradicting assumptions that extended interactions increase AI agent vulnerability.