MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs
Researchers introduced Mindgames, a multi-game arena platform for evaluating large language model agents' social and strategic reasoning across four game environments. A 2025 competition cycle tested 944 agents from 76 teams, revealing that top-performing LLMs rely heavily on explicit structural scaffolding and struggle with rule adherence, while some game environments conflate robustness to errors with genuine strategic ability.
Mindgames addresses a critical gap in LLM evaluation methodology by moving beyond static benchmarks to test dynamic, multi-agent reasoning over extended interactions. The platform operationalizes four complementary reasoning demands—belief attribution under hidden information, opponent modeling, cooperative inference with knowledge asymmetries, and sustained deception—that existing evaluations fail to capture. This matters because deployed LLM agents increasingly operate in interactive settings where sustained strategic reasoning directly impacts performance and user outcomes.
The 2025 competition revealed systemic limitations in current LLM capabilities. Despite having access to game rules and interaction histories, agents exhibited brittle rule adherence and demonstrated heavy dependence on explicit structural scaffolding rather than emergent strategic reasoning. The finding that top performers don't necessarily exhibit superior strategic ability—but rather robustness to opponent errors—suggests evaluation methodologies may be conflating noise tolerance with competence. This error-survival confound, particularly pronounced in Secret Mafia, represents a fundamental measurement problem affecting leaderboard validity.
These findings have immediate implications for developers building multi-agent AI systems. Teams cannot rely on leaderboard rankings as reliable signals of genuine strategic competence; context-specific evaluation within target domains becomes essential. The release of 29,571 logged games and the MG-Ref offline tournament protocol provide infrastructure for more rigorous future benchmarking, enabling researchers to decompose performance into strategic ability versus error-tolerance components.
Looking ahead, the field must develop evaluation frameworks that isolate strategic reasoning from robustness artifacts. Future work should explore whether agents can develop genuinely adaptive strategies or remain dependent on pre-structured reasoning paths, and how evaluation design choices systematically bias leaderboard outcomes.
- →Mindgames competition tested 944 LLM agents across four strategic games, revealing heavy reliance on explicit scaffolding rather than emergent reasoning
- →Current LLM agents exhibit brittle rule adherence and conflate error-survival with strategic ability, compromising leaderboard validity across environments
- →The error-survival confound—particularly in Secret Mafia—demonstrates how evaluation methodology can systematically mismeasure agent capabilities
- →Released dataset of 29,571 games and MG-Ref protocol enable future benchmarking that decomposes performance into strategic versus robustness components
- →Developers building multi-agent systems cannot rely on leaderboard rankings as reliable signals of genuine strategic competence in target domains