🧠 AI⚪ NeutralImportance 7/10

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

arXiv – CS AI|Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu|May 27, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce QUACK, an evaluation framework for auditing whether AI agents in social deduction games actually ground their language in perceived reality or hallucinate claims. Testing three frontier vision-language models reveals that even top performers hallucinate 15% of spatial claims and make accusations without evidence, exposing critical gaps in agent reasoning reliability.

Analysis

QUACK addresses a fundamental problem in AI evaluation: the gap between high-level performance metrics and actual agent competence. Social deduction games provide controlled environments where agents must perceive, reason, communicate, and coordinate—skills essential for real-world AI deployment. Traditional scoring by win rates masks deeper failures in reasoning consistency. This research matters because it demonstrates that current frontier VLMs produce plausible-sounding language disconnected from grounded understanding, a critical vulnerability for any system requiring truthful reasoning.

The framework's three-tier evaluation model—game outcomes, behavioral trajectories, and utterance-level consistency—establishes a methodological blueprint for deeper AI auditing. The Statement Verification Pipeline automatically flags spatial hallucination, unsupported accusations, deception collapse, and language-action inconsistency by reconstructing ground-truth trajectories from engine logs. This automated verification approach scales beyond manual inspection, enabling systematic detection of failure modes.

The empirical findings carry significant implications for AI safety and reliability. If frontier models hallucinate 15% of verifiable spatial claims in controlled environments, this suggests comparable failure rates in other perception-grounding tasks. The discovery that agents make accusations without evidence reveals a reasoning breakdown distinct from raw hallucination—agents construct narratives inconsistent with their own observations.

For AI development, these results highlight that scaling models alone won't resolve grounding problems. Future work should focus on mechanistic understanding of why agents decouple language from perception and whether architectural changes or training approaches can improve consistency. The open-source release enables broader research into multimodal reasoning reliability.

Key Takeaways

→Frontier VLMs hallucinate 15.1% of verifiable spatial claims despite strong overall performance metrics
→Agents frequently make accusations without grounded evidence, indicating reasoning failures beyond simple hallucination
→Traditional win-rate metrics mask underlying language-grounding failures that could compromise real-world AI systems
→Automated statement verification pipelines can systematically audit agent consistency across perception, reasoning, and communication
→Open-source release enables broader multimodal AI safety research and evaluation methodology development

#ai-evaluation #vision-language-models #hallucination-detection #agent-reasoning #multimodal-ai #grounding-consistency #ai-safety #language-perception

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge