Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind
Researchers introduce ToM-SB, a novel challenge where AI defenders must use theory-of-mind reasoning to deceive attackers trying to extract sensitive information. Through reinforcement learning, trained models outperform frontier LLMs like GPT-4 and Gemini-Pro, revealing an emergent bidirectional relationship between belief modeling and deception capabilities.
This research addresses a critical gap in LLM safety by examining how conversational AI systems can defend against adversarial information extraction through sophisticated social reasoning. The ToM-SB challenge represents a meaningful departure from traditional adversarial ML research by introducing a cooperative deception framework where defenders must anticipate attacker beliefs and exploit those assumptions strategically.
The emergent relationship between theory-of-mind capabilities and fooling performance carries significant implications for AI safety architecture. The finding that rewarding either fooling or ToM independently improves both metrics suggests these capacities are fundamentally intertwined—a defender must understand an attacker's mental model to successfully manipulate their conclusions. This challenges assumptions about training objectives being orthogonal and points toward more integrated safety mechanisms.
The work demonstrates that frontier models, despite their scale and capabilities, struggle with this nuanced task, particularly in complex scenarios with partial attacker knowledge. This capability gap has direct implications for deploying LLMs in high-stakes defensive roles, from customer service systems handling sensitive data to AI systems protecting proprietary information. Organizations relying on off-the-shelf models may face vulnerabilities in adversarial contexts.
The generalization to out-of-distribution attackers and the scalability of the approach suggest a path toward more robust defensive systems. However, the research also implicitly raises concerns about AI systems becoming better at sophisticated deception—a double-edged capability with obvious dual-use potential. Future work will likely explore how these findings inform both defensive alignment and the detection of manipulative AI behavior.
- →Frontier LLMs struggle with theory-of-mind based deception against information-extraction attacks, particularly in complex scenarios.
- →Training on both fooling and ToM rewards creates bidirectional capability improvements, revealing their fundamental interdependence.
- →Reinforcement learning-trained double agents outperform GPT-4 and Gemini-Pro on hard test cases, closing a significant capability gap.
- →ToM modeling emerges as a key driver of defensive success, validating belief-state reasoning as central to adversarial interactions.
- →The approach generalizes to stronger attackers and out-of-distribution settings, demonstrating practical scalability for real-world deployment.