Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind
Researchers introduce ToM-SB, a novel challenge where AI defenders must use theory-of-mind reasoning to deceive attackers trying to extract sensitive information. Through reinforcement learning, trained models outperform frontier LLMs like GPT-4 and Gemini-Pro, revealing an emergent bidirectional relationship between belief modeling and deception capabilities.