Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents
Researchers evaluated six defense mechanisms against persistent memory attacks on LLM agents, finding that most input and retrieval-level defenses fail to prevent malicious instruction execution stored in agent memory. Only Memory Sandbox, a memory-layer tool-gating approach, effectively blocked attacks across eight of nine models while maintaining zero utility cost, though it paradoxically increased attack success in one reasoning model by forcing reliance on alternative execution pathways.
This research addresses a critical vulnerability in stateful LLM agents where malicious instructions injected through RAG-retrieved documents persist across sessions and execute in later interactions. The systematic evaluation of defense mechanisms reveals fundamental architectural constraints: input-level filtering cannot detect threats embedded in external documents, while retrieval-level classifiers fall victim to semantic masking techniques that frame malicious requests as compliant queries. These findings highlight why defensive strategies must operate at the appropriate system layer to be effective.
The research emerges as LLM agents increasingly rely on persistent memory and external knowledge retrieval systems for enhanced functionality. As enterprises deploy agents in production environments, the attack surface has expanded significantly beyond single-session prompt injection, making this vulnerability particularly relevant for organizations handling sensitive operations. The 5,040 experimental runs across nine models provide statistical rigor that previous anecdotal observations lacked.
Memory Sandbox's effectiveness demonstrates that architectural redesign—specifically restricting recall capabilities—can eliminate attack vectors entirely. However, the counterintuitive failure case where a reasoning model's existing refusal mechanism is bypassed under the defense reveals that security solutions must account for model-specific behaviors and alternative execution paths. For developers and security teams, these results inform resource allocation: defending at the memory layer proves more effective than input or retrieval filtering, though implementation requires careful testing across different model architectures to avoid unintended capability degradation.
The zero utility cost finding is particularly significant, as it suggests Memory Sandbox doesn't require trading security for functionality. Ongoing work should focus on understanding why certain models exhibit inverted behavior and developing hybrid approaches that maintain defense effectiveness across heterogeneous agent deployments.
- →Input-level and retrieval-level defenses achieve 88-89% attack success rates, statistically equivalent to undefended baseline performance against persistent memory attacks.
- →Memory-layer tool-gating (Memory Sandbox) reduces attack success to 0% for eight of nine models by removing the recall capability that persistent memory attacks require.
- →Semantic masking and compliance-framing techniques defeat retrieval-level classifiers designed to filter malicious instructions from RAG-retrieved documents.
- →One reasoning model inverted under Memory Sandbox defense, achieving 100% attack success by forcing reliance on RAG pathways where refusal mechanisms do not activate.
- →Memory Sandbox imposes no utility cost in non-attack scenarios, making it a practical defense strategy for production LLM agent deployments.