🧠 AI⚪ NeutralImportance 7/10

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

arXiv – CS AI|Wanli Li, Bowen Zhou, Yunyao Yu, Zhou Xu, Yifan Yang, Dongsheng Li, Caihua Shan|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce WeaveBench, a comprehensive benchmark for evaluating computer-use agents across hybrid interfaces combining GUI, CLI, and code operations. The benchmark reveals significant capability gaps, with the best frontier models achieving only 41.2% success rates on 114 real-world tasks, indicating that current AI agents struggle with complex multi-interface orchestration.

Analysis

WeaveBench addresses a critical evaluation gap in AI agent development by testing what matters in real-world scenarios: seamless orchestration across multiple interfaces. Traditional benchmarks compartmentalize testing—evaluating GUI control separately from CLI expertise and code editing—but practical agent deployment demands fluid transitions between these modalities within single workflows. This benchmark's design around actual user requests and verifiable artifacts grounds evaluation in authentic use cases rather than synthetic problems.

The 41.2% maximum pass rate signals that despite recent advances in AI agent capabilities, current systems fundamentally struggle with sustained multi-step reasoning across heterogeneous interfaces. The trajectory-aware judge mechanism is particularly valuable, detecting shortcut behaviors like fabricated visual evidence that outcome-only evaluation would miss. This methodological rigor prevents inflated performance claims and reveals true capability gaps.

For the AI agent ecosystem, WeaveBench establishes higher evaluation standards that will shape development priorities. Teams building agent runtimes and model providers will face pressure to improve cross-interface reasoning and long-horizon planning. The benchmark's real Ubuntu desktop environment and published artifacts ensure reproducibility and prevent gaming through domain-specific optimizations.

Future work likely focuses on whether scaling model capacity, improving prompting strategies, or architectural changes to agent runtimes can meaningfully improve performance on such tasks. The substantial gap between current performance and task difficulty suggests that hybrid-interface competence remains a frontier challenge for AI development.

Key Takeaways

→WeaveBench introduces 114 real-world tasks requiring agents to coordinate GUI, CLI, and code operations within single workflows.
→Best frontier model performance reaches only 41.2% success rate, exposing a critical capability gap in current AI agents.
→Trajectory-aware judging mechanism prevents performance overestimation by detecting shortcut behaviors and fabricated evidence.
→Existing benchmarks underestimate evaluation challenges by testing interfaces separately rather than in integrated workflows.
→The benchmark uses real Ubuntu desktops and verifiable artifacts, ensuring reproducible evaluation resistant to gaming.

#ai-agents #benchmarking #computer-use-agents #evaluation-methodology #multi-modal-ai #long-horizon-tasks #frontier-models #agent-capabilities

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

WeaveBench: A Long-Horizon, Real-World Benchmark for Computer-Use Agents with Hybrid Interfaces

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge