🧠 AI⚪ NeutralImportance 6/10

ChainWorld: Composing Long-Horizon Desktop Workloads from Atomic OSWorld Tasks

arXiv – CS AI|Vincent Siu, Manasi Sharma, Dawn Song, Daniel Yue Zhang, Chenguang Wang|June 23, 2026 at 04:00 AM

🤖AI Summary

ChainWorld introduces a new evaluation framework that composes atomic OSWorld tasks into longer, multi-step desktop workloads to better assess computer use agents in realistic scenarios. Testing across four models reveals maximum chain completion rates of only 31%, with distinct failure patterns between single-turn and multi-turn evaluation protocols.

Analysis

ChainWorld addresses a critical gap in AI agent evaluation methodology. Current benchmarks test computer use agents on isolated, atomic tasks, but real-world desktop work demands maintaining context and state across sequential objectives. This research bridges that gap by composing 347 task chains of varying length through directional compatibility search, creating workloads that more accurately reflect how users interact with computers.

The evaluation framework employs two distinct protocols. Single-turn evaluation presents all tasks in one prompt, testing the agent's ability to parse complex, multi-part instructions upfront. Multi-turn evaluation reveals tasks sequentially, better mimicking how humans discover objectives dynamically during a work session. This dual approach provides richer diagnostic data than traditional benchmarks.

The results are sobering for the field. Across all tested models, maximum chain completion reaches only 31%, indicating substantial room for improvement in current agent capabilities. More importantly, the failure analysis reveals protocol-dependent breakdown patterns. Single-turn failures concentrate on artifact precision—agents struggle with exact task specifications when handling multiple objectives simultaneously. Multi-turn failures expose session management weaknesses, including fragmented progress and disengagement in later turns.

These findings have implications for both AI developers and researchers. The distinct failure profiles suggest agents require different architectural or training improvements depending on evaluation context. For the broader AI ecosystem, ChainWorld establishes a more rigorous evaluation standard that could accelerate development of genuinely capable desktop agents. The methodology may influence how future agent benchmarks are designed, pushing the field toward more realistic task compositions.

Key Takeaways

→ChainWorld reveals computer use agents achieve only 31% maximum completion on multi-step desktop workloads, significantly lower than single-task performance.
→Single-turn evaluation exposes artifact precision failures while multi-turn evaluation reveals session management weaknesses like disengagement.
→The benchmark contains 347 task chains ranging from two to four steps, creating more realistic evaluation scenarios than existing atomic task frameworks.
→Different evaluation protocols expose different failure modes, suggesting agents need targeted improvements for sequential task handling.
→Current computer use agents lack robust state persistence mechanisms needed for realistic long-horizon desktop work.