AINeutralarXiv – CS AI · 10h ago7/10
🧠
Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents
Researchers evaluated six defense mechanisms against persistent memory attacks on LLM agents, finding that most input and retrieval-level defenses fail to prevent malicious instruction execution stored in agent memory. Only Memory Sandbox, a memory-layer tool-gating approach, effectively blocked attacks across eight of nine models while maintaining zero utility cost, though it paradoxically increased attack success in one reasoning model by forcing reliance on alternative execution pathways.