AIBullisharXiv – CS AI · Apr 137/10
🧠Researchers introduce a hybrid framework combining probabilistic models with large language models to improve social reasoning in AI agents, achieving a 67% win rate against human players in the game Avalon—a breakthrough in AI's ability to infer beliefs and intentions from incomplete information.
AINeutralarXiv – CS AI · May 276/10
🧠Researchers introduce OmniToM, a new benchmark for evaluating Theory of Mind capabilities in large language models by requiring explicit modeling of belief structures rather than just final answers. The benchmark reveals that current LLMs struggle with tracking actor-specific beliefs and understanding knowledge access, exposing fundamental limitations in social reasoning despite high performance on traditional end-point question answering tasks.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers introduce SocialGrid, a benchmark environment for evaluating Large Language Models as autonomous agents in multi-agent social scenarios. The study reveals that even the most capable open-source LLMs achieve below 60% task completion and struggle significantly with social reasoning tasks like detecting deception, exposing critical limitations in current AI agent capabilities.
AINeutralarXiv – CS AI · Apr 206/10
🧠Researchers introduced RoleConflictBench, a benchmark dataset containing over 13,000 scenarios across 65 social roles designed to test whether large language models prioritize contextual cues or learned preferences when facing conflicting role expectations. Analysis of 10 leading LLMs revealed that models predominantly rely on ingrained role preferences rather than responding dynamically to situational urgency, indicating a significant gap in contextual sensitivity.
AIBullisharXiv – CS AI · Apr 146/10
🧠Researchers introduce CoSToM, a framework that uses causal tracing and activation steering to improve Theory of Mind alignment in large language models. The work addresses a critical gap between LLMs' internal knowledge and external behavior, demonstrating that targeted interventions in specific neural layers can enhance social reasoning capabilities and dialogue quality.
AIBullisharXiv – CS AI · Mar 116/10
🧠Researchers introduce Social-R1, a reinforcement learning framework that enhances social reasoning in large language models by training on adversarial examples. The approach enables a 4B parameter model to outperform larger models across eight benchmarks by supervising the entire reasoning process rather than just outcomes.
AIBearisharXiv – CS AI · Mar 36/104
🧠Researchers introduced SimpleToM, a benchmark revealing that state-of-the-art language models can infer mental states but struggle to apply that knowledge for behavior prediction and judgment. The study exposes a critical gap between explicit Theory of Mind inference and implicit application in real-world scenarios.
AINeutralarXiv – CS AI · Apr 64/10
🧠Research reveals that large language models can reproduce the qualitative structure of human social reasoning but struggle with quantitative magnitude calibration. Pragmatic prompting strategies that consider speaker knowledge and motives can improve this calibration, though fine-grained accuracy remains partially unresolved.
AINeutralarXiv – CS AI · Mar 54/10
🧠Researchers evaluated five Multimodal Large Language Models (MLLMs) on their ability to reason about social norms in both text and image scenarios. GPT-4o performed best overall, while all models showed superior performance with text-based norm reasoning compared to image-based scenarios.
🧠 GPT-4