From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors
Researchers reveal a critical vulnerability in LLM agents operating in local workspaces, where attackers can plant hidden prompt injections across multiple steps to gain persistent control. The new ClawTrojan benchmark demonstrates 95.5% attack success rates against GPT-5.4, while a proposed defense mechanism called DASGuard offers runtime protection by tracing and sanitizing potentially malicious control text in sensitive files.
The evolution of large language models from conversational interfaces to autonomous agents capable of file operations and tool integration creates a fundamentally different threat landscape than traditional AI safety concerns. This research exposes a sophisticated attack vector that exploits the temporal and spatial separation between malicious input injection and execution, allowing attackers to bypass defenses designed for single-turn interactions. Rather than deploying an immediately harmful prompt, adversaries can embed trojan instructions within files or tool outputs that agents later consume and execute, making detection significantly harder.
This vulnerability represents a natural escalation in AI security research as agents gain real-world operational capabilities. Previous defenses focused on blocking obviously malicious requests, but this multi-step attack paradigm operates in a gray zone where individual actions appear benign in isolation. An agent reading a file with embedded instructions or storing data for later sessions represents legitimate functionality, yet collectively these operations enable persistent backdoor access.
For developers and organizations deploying agentic systems in production environments, this research carries immediate practical implications. The high success rate against current models suggests that many existing deployments may lack adequate protections against this attack class. DASGuard's approach of tracing control content origin and implementing runtime sanitization provides a defensible strategy, though it requires careful implementation to avoid false positives that would limit agent functionality.
The research underscores that agentic AI systems require fundamentally different security architectures than traditional software, where trust boundaries and state management across sessions demand continuous re-evaluation. Organizations must implement provenance tracking for all inputs and develop comprehensive testing protocols that evaluate multi-step attack sequences rather than isolated prompts.
- βMulti-step trojan attacks in LLM agents achieve 95.5% success rates by exploiting temporal separation between malicious injection and execution.
- βExisting single-step prompt injection defenses fail to detect earlier write operations that plant persistent backdoors in local workspaces.
- βDASGuard mitigates this threat by scanning files for control-like text, tracing its origin, and removing content from untrusted sources.
- βAgentic AI systems require fundamentally different security architectures than traditional chatbots due to persistent state and file system access.
- βOrganizations deploying autonomous agents must implement provenance tracking and multi-step attack evaluation in their security testing protocols.