AINeutralarXiv – CS AI · 8h ago6/10
🧠
ChainWorld: Composing Long-Horizon Desktop Workloads from Atomic OSWorld Tasks
ChainWorld introduces a new evaluation framework that composes atomic OSWorld tasks into longer, multi-step desktop workloads to better assess computer use agents in realistic scenarios. Testing across four models reveals maximum chain completion rates of only 31%, with distinct failure patterns between single-turn and multi-turn evaluation protocols.