y0news
← Feed
Back to feed
🧠 AI🔴 BearishImportance 7/10

EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings

arXiv – CS AI|Shiva Krishna Reddy Malay, Shravan Nayak, Jishnu Sethumadhavan Nair, Sagar Davasam, Aman Tiwari, Sathwik Tejaswi Madhusudhan, Sridhar Krishna Nemala, Srinivas Sunkara, Sai Rajeswar|
🤖AI Summary

Researchers introduced EnterpriseOps-Gym, a new benchmark for evaluating AI agents in enterprise environments, revealing that even top models like Claude Opus 4.5 achieve only 37.4% success rates. The study highlights critical limitations in current AI agents for autonomous enterprise deployment, particularly in strategic reasoning and task feasibility assessment.

Key Takeaways
  • EnterpriseOps-Gym benchmark tests AI agents on 1,150 tasks across eight enterprise verticals using 164 database tables and 512 tools.
  • Top-performing Claude Opus 4.5 achieved only 37.4% success rate in enterprise planning tasks.
  • Providing oracle human plans improved performance by 14-35 percentage points, indicating strategic reasoning as the main bottleneck.
  • AI agents frequently fail to refuse infeasible tasks, with the best model achieving only 53.9% refusal accuracy.
  • Current AI agents are not ready for autonomous enterprise deployment due to reliability and safety concerns.
Mentioned in AI
Models
ClaudeAnthropic
OpusAnthropic
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles