AIBullisharXiv – CS AI · May 297/10
🧠Researchers introduce PokerSkill, a framework that enables large language models to play expert-level poker without training or computational solvers by combining rule-based poker skills with LLM reasoning. The approach achieves competitive performance against state-of-the-art GTO benchmarks, reducing losses by 49-61% compared to standard LLM prompting and outperforming established poker bots.
🧠 GPT-5🧠 Claude🧠 Opus
AIBearisharXiv – CS AI · May 47/10
🧠Researchers have identified critical vulnerabilities in how large language models make strategic decisions under incomplete information, revealing gaps between their internal beliefs and external reasoning. The study demonstrates that LLMs encode more accurate hidden beliefs than they express verbally, but these beliefs are brittle and degrade with multi-hop reasoning, raising significant concerns about deploying LLMs in high-stakes decision-making scenarios without safeguards.
🧠 Llama
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers demonstrate that Reinforcement Learning from Verifiable Rewards (RLVR) can train Large Language Models to negotiate effectively in incomplete-information games like price bargaining. A 30B parameter model trained with this method outperforms frontier models 10x its size and develops sophisticated persuasive strategies while generalizing to unseen negotiation scenarios.
AINeutralarXiv – CS AI · Mar 117/10
🧠Researchers introduce STAR Benchmark, a new evaluation framework for testing Large Language Models in competitive, real-time environments. The study reveals a strategy-execution gap where reasoning-heavy models excel in turn-based settings but struggle in real-time scenarios due to inference latency.
AINeutralarXiv – CS AI · Jun 96/10
🧠Researchers demonstrate that symbolic reasoning frameworks (I-Ching, Tarot) injected as prompts into language models deployed as strategic agents significantly reshape multi-agent game outcomes by modulating risk-aversion behaviors, producing framework-specific winner distributions in a 7-player diplomacy simulation without the agents following the frameworks' literal content.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduced Mindgames, a multi-game arena platform for evaluating large language model agents' social and strategic reasoning across four game environments. A 2025 competition cycle tested 944 agents from 76 teams, revealing that top-performing LLMs rely heavily on explicit structural scaffolding and struggle with rule adherence, while some game environments conflate robustness to errors with genuine strategic ability.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers introduce Strat-Reasoner, an RL-based framework that enhances large language models' strategic reasoning in multi-agent game environments by integrating recursive reasoning across all agents and employing centralized evaluation. The approach demonstrates 22.1% average performance improvements, addressing a critical limitation where LLMs struggle with non-stationary multi-agent dynamics.
AINeutralarXiv – CS AI · May 46/10
🧠Researchers have developed Solly, an AI agent that achieved elite human-level performance in Liar's Poker through self-play reinforcement learning, winning over 50% of hands against top players. This breakthrough extends AI capabilities beyond two-player games to complex multi-player scenarios with imperfect information, demonstrating novel strategic behaviors that resist exploitation by world-class competitors.