WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces
Researchers introduce WeaveBench, a comprehensive benchmark for evaluating computer-use agents across hybrid interfaces combining GUI, CLI, and code operations. The benchmark reveals significant capability gaps, with the best frontier models achieving only 41.2% success rates on 114 real-world tasks, indicating that current AI agents struggle with complex multi-interface orchestration.
WeaveBench addresses a critical evaluation gap in AI agent development by testing what matters in real-world scenarios: seamless orchestration across multiple interfaces. Traditional benchmarks compartmentalize testing—evaluating GUI control separately from CLI expertise and code editing—but practical agent deployment demands fluid transitions between these modalities within single workflows. This benchmark's design around actual user requests and verifiable artifacts grounds evaluation in authentic use cases rather than synthetic problems.
The 41.2% maximum pass rate signals that despite recent advances in AI agent capabilities, current systems fundamentally struggle with sustained multi-step reasoning across heterogeneous interfaces. The trajectory-aware judge mechanism is particularly valuable, detecting shortcut behaviors like fabricated visual evidence that outcome-only evaluation would miss. This methodological rigor prevents inflated performance claims and reveals true capability gaps.
For the AI agent ecosystem, WeaveBench establishes higher evaluation standards that will shape development priorities. Teams building agent runtimes and model providers will face pressure to improve cross-interface reasoning and long-horizon planning. The benchmark's real Ubuntu desktop environment and published artifacts ensure reproducibility and prevent gaming through domain-specific optimizations.
Future work likely focuses on whether scaling model capacity, improving prompting strategies, or architectural changes to agent runtimes can meaningfully improve performance on such tasks. The substantial gap between current performance and task difficulty suggests that hybrid-interface competence remains a frontier challenge for AI development.
- →WeaveBench introduces 114 real-world tasks requiring agents to coordinate GUI, CLI, and code operations within single workflows.
- →Best frontier model performance reaches only 41.2% success rate, exposing a critical capability gap in current AI agents.
- →Trajectory-aware judging mechanism prevents performance overestimation by detecting shortcut behaviors and fabricated evidence.
- →Existing benchmarks underestimate evaluation challenges by testing interfaces separately rather than in integrated workflows.
- →The benchmark uses real Ubuntu desktops and verifiable artifacts, ensuring reproducible evaluation resistant to gaming.