#agent-limitations News & Analysis

2 articles tagged with #agent-limitations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles

AINeutralarXiv – CS AI · Jun 106/10

🧠

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Researchers introduce Workflow-GYM, a benchmark for evaluating AI agents on complex, long-horizon professional GUI tasks across specialized software environments. Testing reveals that even state-of-the-art models achieve only 30% success rates, exposing significant limitations in agent consistency, error handling, and domain-specific software comprehension.

AINeutralarXiv – CS AI · Jun 26/10

🧠

HLL: Can Agents Cross Humanity's Last Line of Verification?

Researchers introduced HLL (Humanity's Last Line of Verification), a benchmark testing whether multimodal AI agents can bypass CAPTCHA protections designed to verify human users. Testing eight frontier models revealed significant brittleness: agent performance varies sharply across CAPTCHA types, degrades under realistic conditions, and fails when solutions must be supported by valid action traces, exposing gaps in localization, action calibration, and process consistency.