QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents
Researchers introduce QUACK, an evaluation framework for auditing whether AI agents in social deduction games actually ground their language in perceived reality or hallucinate claims. Testing three frontier vision-language models reveals that even top performers hallucinate 15% of spatial claims and make accusations without evidence, exposing critical gaps in agent reasoning reliability.
QUACK addresses a fundamental problem in AI evaluation: the gap between high-level performance metrics and actual agent competence. Social deduction games provide controlled environments where agents must perceive, reason, communicate, and coordinate—skills essential for real-world AI deployment. Traditional scoring by win rates masks deeper failures in reasoning consistency. This research matters because it demonstrates that current frontier VLMs produce plausible-sounding language disconnected from grounded understanding, a critical vulnerability for any system requiring truthful reasoning.
The framework's three-tier evaluation model—game outcomes, behavioral trajectories, and utterance-level consistency—establishes a methodological blueprint for deeper AI auditing. The Statement Verification Pipeline automatically flags spatial hallucination, unsupported accusations, deception collapse, and language-action inconsistency by reconstructing ground-truth trajectories from engine logs. This automated verification approach scales beyond manual inspection, enabling systematic detection of failure modes.
The empirical findings carry significant implications for AI safety and reliability. If frontier models hallucinate 15% of verifiable spatial claims in controlled environments, this suggests comparable failure rates in other perception-grounding tasks. The discovery that agents make accusations without evidence reveals a reasoning breakdown distinct from raw hallucination—agents construct narratives inconsistent with their own observations.
For AI development, these results highlight that scaling models alone won't resolve grounding problems. Future work should focus on mechanistic understanding of why agents decouple language from perception and whether architectural changes or training approaches can improve consistency. The open-source release enables broader research into multimodal reasoning reliability.
- →Frontier VLMs hallucinate 15.1% of verifiable spatial claims despite strong overall performance metrics
- →Agents frequently make accusations without grounded evidence, indicating reasoning failures beyond simple hallucination
- →Traditional win-rate metrics mask underlying language-grounding failures that could compromise real-world AI systems
- →Automated statement verification pipelines can systematically audit agent consistency across perception, reasoning, and communication
- →Open-source release enables broader multimodal AI safety research and evaluation methodology development