🧠 AI⚪ NeutralImportance 6/10

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

arXiv – CS AI|Adam Bawatneh, Sagar Sapkota, Amrit Singh Bedi, Santu Karmaker, Mubarak Shah|May 27, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce OmniToM, a new benchmark for evaluating Theory of Mind capabilities in large language models by requiring explicit modeling of belief structures rather than just final answers. The benchmark reveals that current LLMs struggle with tracking actor-specific beliefs and understanding knowledge access, exposing fundamental limitations in social reasoning despite high performance on traditional end-point question answering tasks.

Analysis

OmniToM addresses a critical gap in how researchers evaluate Theory of Mind in language models. Traditional benchmarks measure only whether models arrive at correct answers to social reasoning questions, creating an illusion of understanding while masking whether models actually construct the mental-state representations necessary for robust reasoning. This new benchmark forces models to explicitly map out what each character believes, knows, and intends throughout a narrative, providing visibility into their reasoning process.

The research emerges from growing recognition that end-point evaluation conceals significant weaknesses in LLM reasoning. As AI systems increasingly interact in social contexts requiring nuanced understanding of human intentions and knowledge states, gaps in Theory of Mind become more consequential. OmniToM's seven-dimensional labeling schema—covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context—provides unprecedented granularity in assessment.

The benchmark's findings carry important implications for AI development. By systematically revealing that models struggle with knowledge-access decisions and transforming narrative facts into actors' beliefs, OmniToM identifies concrete areas for model improvement. Developers now have clearer diagnostic tools to target specific weaknesses rather than optimizing against aggregate metrics. The benchmark's scale (22,343 labeled belief propositions across 895 stories) and human-calibrated annotation pipeline establish it as a rigorous evaluation standard.

Looking forward, OmniToM will likely become essential infrastructure for training socially competent AI systems. As multimodal and embodied AI systems advance, explicit belief modeling becomes increasingly critical for safety and reliability. Researchers will likely use these explicit representations to develop training techniques that improve social reasoning, potentially reshaping how foundation models are built.

Key Takeaways

→OmniToM benchmark requires explicit belief modeling rather than just answer correctness, revealing hidden reasoning gaps in LLMs.
→Current language models show significant bottlenecks in tracking actor-specific beliefs and understanding knowledge access across narratives.
→The benchmark's seven-dimensional labeling schema provides granular insight into how models represent complex mental states.
→Explicit belief representations offer clearer diagnostic information for improving AI social reasoning compared to traditional end-point evaluation.
→Research establishes new evaluation standards that will influence how future socially-competent AI systems are developed and tested.