🧠 AI⚪ NeutralImportance 7/10

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

arXiv – CS AI|Gurusha Juneja, Dylan Lu, Saaket Agashe, Parth Diwane, Edward Gunn, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Yali Du, Xin Eric Wang|May 12, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce EnactToM, a benchmark testing whether AI agents can understand and act on others' beliefs in multi-agent embodied environments. Current frontier models achieve 0% on functional theory of mind tasks, revealing a critical gap in AI reasoning capabilities despite performing better on direct belief questions.

Analysis

EnactToM addresses a fundamental limitation in current AI evaluation frameworks: most benchmarks test explicit theory of mind through direct questions about beliefs, overlooking the practical ability to infer and act on implicit mental states. This distinction between literal and functional theory of mind represents a qualitative jump in AI reasoning requirements. In real multi-agent collaboration—whether robots coordinating in warehouses, autonomous vehicles sharing roads, or software agents managing distributed systems—actors must navigate partial observability, hidden information, and communication constraints without explicit instruction about others' beliefs.

The zero-percent performance on hard functional tasks across seven frontier models signals that scaling existing architectures doesn't automatically solve epistemic coordination. The manual failure analysis pinpoints concrete breakdowns: agents fail to respect information asymmetries, ignore partner constraints, and misallocate communication. This contrasts sharply with the 45% performance on literal belief probes, suggesting models can answer about beliefs without truly reasoning through collaborative scenarios.

For the AI industry, this benchmark establishes a meaningful complexity frontier beyond current capabilities. The iterative difficulty scaling creates a persistent evaluation challenge that prevents benchmark gaming. Embodied multi-agent settings more closely mirror real-world deployment contexts than isolated belief questions, making functional ToM a critical capability for commercial AI systems. Organizations developing collaborative agents—whether in robotics, gaming, or autonomous systems—must confront this gap explicitly. The research trajectory suggests that advances require architectural innovations in epistemic modeling and communication reasoning, not merely parameter scaling.

Key Takeaways

→Frontier AI models fail completely at functional theory of mind tasks despite 45% accuracy on literal belief questions
→EnactToM provides formally verified 3D embodied benchmarks with partial observability and constrained communication
→93% of task failures trace to epistemic coordination breakdowns including information withholding and partner constraint violations
→Current AI architectures struggle to act optimally on implicit beliefs rather than answer explicit questions about beliefs
→The benchmark automatically increases difficulty as models improve, preventing evaluation plateau and gaming

#theory-of-mind #embodied-ai #multi-agent #benchmarking #epistemic-reasoning #ai-evaluation #collaborative-agents

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AI5d ago

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AI6d ago

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AI6d ago

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge