y0news
← Feed
Back to feed
🧠 AI NeutralImportance 7/10

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

arXiv – CS AI|Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han|
🤖AI Summary

Researchers introduce OpenClawBench, a large-scale dataset of 31,264 annotated agent execution trajectories that reveals a significant gap between task success and process reliability. The study finds that 9.3% of oracle-passing executions contain process-side anomalies like unresolved ambiguities and unsafe operations, demonstrating that success metrics alone mask critical failure modes in AI agent systems.

Analysis

OpenClawBench addresses a fundamental blind spot in AI agent evaluation: the distinction between achieving task outcomes and executing them safely and reliably. While traditional evaluation metrics focus on whether agents complete tasks, this research exposes how agents can pass final checks while accumulating serious process failures—unresolved ambiguities, ignored errors, unsafe external writes, and overcommitted capabilities. This outcome-process gap matters because real-world deployment requires not just functional correctness but operational reliability.

The research emerges from growing recognition that AI agents operating in complex environments need stronger guardrails. As agents gain access to real tools and external systems, the consequences of process failures multiply beyond simple task incompletion. The dataset's scale—31,264 trajectories from multiple source models—enables systematic measurement of failure modes that single-task evaluations would miss. The finding that 2,904 successful executions still contain anomalies represents a 9.3% failure rate masked by success-only metrics.

For the AI industry, this framework establishes monitoring capabilities essential for safe agent deployment. The FullTax taxonomy (binary labels, severity ratings, recoverability assessment, localization information) transforms raw execution logs into actionable supervision data. A fine-tuned Gemma detector achieving 0.729 binary F1 demonstrates feasibility of automated anomaly detection at scale.

Moving forward, integration of process-side monitoring into production agent systems becomes critical. Developers must adopt multi-dimensional evaluation frameworks that track both outcomes and execution patterns. The research positions OpenClawBench as foundational infrastructure for building more reliable autonomous systems, particularly as agents increasingly interface with high-stakes environments requiring operational transparency.

Key Takeaways
  • 9.3% of task-successful agent executions contain process anomalies, revealing a major gap between outcome and execution quality metrics
  • OpenClawBench's FullTax taxonomy enables structured classification of six distinct anomaly types with severity and recoverability assessment
  • Automated anomaly detection using fine-tuned Gemma 3 12B achieves competitive F1 scores, enabling scalable runtime monitoring of agent reliability
  • Process-side evaluation frameworks are essential infrastructure for safe deployment of autonomous agents in real-world systems
  • The 31,264-trajectory dataset establishes measurement standards for agent reliability beyond traditional task-completion metrics
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles