MEMSAD: Gradient-Coupled Anomaly Detection for Memory Poisoning in Retrieval-Augmented Agents
Researchers present MEMSAD, a defense mechanism against memory poisoning attacks on retrieval-augmented LLM agents, using gradient-coupled anomaly detection to identify adversarial perturbations while maintaining retrieval performance. The work formalizes security vulnerabilities in persistent external memory systems and demonstrates that while composite defenses achieve perfect detection rates, synonym-based attacks remain undetectable by embedding-based approaches.
This paper addresses a critical security gap in LLM agent architectures as they increasingly rely on persistent memory for multi-session context maintenance. The formalization of memory poisoning as a Stackelberg game provides a rigorous framework for understanding adversarial dynamics, and the authors' correction of previous evaluation protocols reveals that attack success rates were significantly underestimated, increasing by 4x under faithful evaluation conditions.
The MEMSAD defense mechanism represents a meaningful advance in adversarial robustness for language models. By establishing a gradient coupling theorem proving that detection risk and retrieval performance are mathematically coupled, the authors create a certified detection radius with minimax-optimal sample complexity bounds. This theoretical foundation distinguishes MEMSAD from heuristic-based defenses and provides formal guarantees about detection reliability regardless of adversary sophistication.
For AI developers and companies deploying retrieval-augmented agents in production systems, this research has immediate relevance. The demonstration that composite defenses achieve 100% true positive and 0% false positive rates across diverse attack vectors suggests practical deployment paths, while the identified synonym-substitution loophole indicates a fundamental limitation of embedding-based approaches that warrants architectural consideration.
Looking forward, this work establishes a research direction for formal security properties of agent memory systems. The discrete synonym-invariance boundary suggests that closing detection gaps may require hybrid approaches combining embedding-based detection with discrete-space defenses, or architectural changes to memory retrieval mechanisms themselves. As agents become more autonomous and maintain longer-lived memories, these security properties will increasingly influence trustworthiness and adoptability in high-stakes applications.
- βMEMSAD achieves 100% detection accuracy on standard attacks while maintaining full retrieval functionality through gradient-coupled anomaly detection.
- βCorrected evaluation protocols show prior memory poisoning attack success rates were underestimated by 4x, indicating stronger threats than previously understood.
- βSynonym substitution attacks evade all embedding-based defenses, exposing a fundamental limitation that requires non-continuous defense strategies.
- βMinimax-optimal analysis proves any threshold detector requires Ξ©(1/ΟΒ²) calibration samples, and MEMSAD achieves theoretical optimality within logarithmic factors.
- βFormal characterization of the discrete-space defense boundary provides a roadmap for hybrid defense architectures combining multiple detection modalities.