🧠 AI🟢 BullishImportance 7/10

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

arXiv – CS AI|Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu|June 11, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce ISE (Intent → Simulate → Execute), a three-stage framework for training OS agents that generates 43,956 structured intents and 23,132 multi-turn trajectories with live execution validation. Fine-tuning Qwen3-8B on this dataset achieves 37.7% pass@1 on ClawEval, outperforming GPT-4o zero-shot and the larger Qwen3-32B model, demonstrating that high-quality synthetic data design can overcome model scale limitations.

Analysis

The ISE framework addresses a critical bottleneck in agent training: the absence of datasets combining structured user intent, multi-turn task delegation, and grounded tool execution. Most existing OS-agent datasets either lack authentic execution outcomes or fail to capture the complexity of real user workflows. This research proposes a systematic synthesis approach that moves beyond static benchmarks toward dynamic, failure-aware training signals.

The methodology is particularly noteworthy because Stage 3 executes every tool call in isolated OS environments rather than simulating outcomes. This generates authentic failure-recovery dynamics—a cornerstone of robust agent behavior—that purely synthetic or templated datasets cannot capture. The Vendi Score of 61.57 indicates strong diversity in the intent pool, reducing memorization risks. The ablation study confirming Stage 2's (multi-turn simulation) contribution validates that trajectory complexity, not just breadth, drives performance gains.

The practical implication is significant: a 18-point improvement (19.3% to 37.7%) on Qwen3-8B demonstrates that dataset quality can compress performance gaps typically associated with 4x model scaling. This challenges the assumption that capability scaling requires proportional parameter increases and suggests that AI companies investing in synthetic data infrastructure may achieve competitive advantages without massive computational overhead.

The public release of ISETrace and source code accelerates reproducibility and competitive benchmarking. Future research should explore whether this data-centric approach generalizes to other agent domains (web, API, code) and how it interacts with emerging reasoning models that may process execution traces differently than current architectures.

Key Takeaways

→ISE's three-stage synthesis produces 43,956 diverse intents and 23,132 trajectories with authentic OS execution outcomes
→Qwen3-8B fine-tuned on ISETrace achieves 37.7% ClawEval pass@1, outperforming GPT-4o zero-shot and 4x larger base model
→Live execution in isolated OS workspaces generates failure-recovery dynamics absent from simulated datasets
→18-point performance gain demonstrates data quality can partially substitute for model scale in agent training
→Public dataset release enables community benchmarking and accelerates OS-agent research

Mentioned in AI

Models

GPT-4OpenAI