OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
Researchers introduce OpenClawBench, a large-scale dataset of 31,264 annotated agent execution trajectories that reveals a significant gap between task success and process reliability. The study finds that 9.3% of oracle-passing executions contain process-side anomalies like unresolved ambiguities and unsafe operations, demonstrating that success metrics alone mask critical failure modes in AI agent systems.
OpenClawBench addresses a fundamental blind spot in AI agent evaluation: the distinction between achieving task outcomes and executing them safely and reliably. While traditional evaluation metrics focus on whether agents complete tasks, this research exposes how agents can pass final checks while accumulating serious process failures—unresolved ambiguities, ignored errors, unsafe external writes, and overcommitted capabilities. This outcome-process gap matters because real-world deployment requires not just functional correctness but operational reliability.
The research emerges from growing recognition that AI agents operating in complex environments need stronger guardrails. As agents gain access to real tools and external systems, the consequences of process failures multiply beyond simple task incompletion. The dataset's scale—31,264 trajectories from multiple source models—enables systematic measurement of failure modes that single-task evaluations would miss. The finding that 2,904 successful executions still contain anomalies represents a 9.3% failure rate masked by success-only metrics.
For the AI industry, this framework establishes monitoring capabilities essential for safe agent deployment. The FullTax taxonomy (binary labels, severity ratings, recoverability assessment, localization information) transforms raw execution logs into actionable supervision data. A fine-tuned Gemma detector achieving 0.729 binary F1 demonstrates feasibility of automated anomaly detection at scale.
Moving forward, integration of process-side monitoring into production agent systems becomes critical. Developers must adopt multi-dimensional evaluation frameworks that track both outcomes and execution patterns. The research positions OpenClawBench as foundational infrastructure for building more reliable autonomous systems, particularly as agents increasingly interface with high-stakes environments requiring operational transparency.
- →9.3% of task-successful agent executions contain process anomalies, revealing a major gap between outcome and execution quality metrics
- →OpenClawBench's FullTax taxonomy enables structured classification of six distinct anomaly types with severity and recoverability assessment
- →Automated anomaly detection using fine-tuned Gemma 3 12B achieves competitive F1 scores, enabling scalable runtime monitoring of agent reliability
- →Process-side evaluation frameworks are essential infrastructure for safe deployment of autonomous agents in real-world systems
- →The 31,264-trajectory dataset establishes measurement standards for agent reliability beyond traditional task-completion metrics