AINeutralarXiv – CS AI · 7h ago6/10
🧠
AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents
Researchers introduced AgentEscapeBench, a benchmark that evaluates how well LLM-based agents can reason through complex, multi-step tasks requiring external tool use and long-range dependency tracking. Testing 16 LLM agents against 270 escape-room-style problems revealed significant performance degradation as task complexity increased, with the best models dropping from 90% success to 60% as dependency depth tripled, highlighting a critical limitation in current AI agent capabilities.