y0news
← Feed
Back to feed
🧠 AI NeutralImportance 7/10

Evaluating Interactive Reasoning in Large Language Models: A Hierarchical Benchmark with Executable Games

arXiv – CS AI|Mingyuan Fan, Weiguang Han, Daixin Wang, Cen Chen, Zhiqiang Zhang, Jun Zhou|
🤖AI Summary

Researchers introduced a new benchmark for evaluating large language models' reasoning capabilities through interactive games where LLMs must query hidden environments, integrate observations, and adapt strategies. The framework reveals significant performance gaps among frontier models in both success rates and interaction efficiency, with contextual perturbations causing moderate declines but metacognitive tasks producing much larger performance drops.

Analysis

This research presents a fundamental shift in how AI reasoning capabilities are assessed. Rather than treating reasoning as a static problem-solving task, the framework conceptualizes it as dynamic evidence acquisition and belief updating—a more realistic reflection of how reasoning occurs in practice. The benchmark's 474 executable games with five difficulty levels provide a rigorous testing ground that discriminates between models in ways traditional benchmarks cannot.

The findings carry important implications for AI development. The stark differences in interaction efficiency suggest that frontier models vary dramatically in their ability to ask clarifying questions and iteratively refine understanding—a critical capability for real-world applications from scientific research to customer support. The research exposes a vulnerability: models show moderate resilience to contextual perturbations but severe degradation under counterfactual reasoning and necessity judgment tasks, indicating that current LLMs struggle with metacognitive processes that humans navigate intuitively.

For the AI industry, this benchmark addresses a growing problem: existing evaluation metrics fail to capture reasoning quality comprehensively. As organizations deploy LLMs in high-stakes domains, understanding not just accuracy but also interaction patterns becomes crucial. The framework provides a quantitative foundation for comparing models' reasoning robustness and adaptability.

Looking forward, this work will likely influence how AI labs design training procedures and evaluation protocols. The distinction between contextual robustness and metacognitive adaptation offers a roadmap for targeted improvements. Future research may leverage these findings to develop better prompting strategies or training methods that strengthen model reasoning under uncertainty, particularly in multi-turn interactions where evidence accumulation and strategic questioning become central.

Key Takeaways
  • Interactive reasoning evaluation framework treats LLM reasoning as active evidence acquisition rather than static problem-solving
  • Benchmark of 474 executable games discriminates significantly between frontier models in both success rate and interaction efficiency
  • Models show moderate resistance to contextual perturbations but severe degradation in counterfactual reasoning and necessity judgment
  • Current LLMs demonstrate limited metacognitive adaptation capabilities despite strong overall performance metrics
  • Framework addresses critical gap in AI evaluation by measuring robustness and strategic questioning in multi-turn interactions
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles