ENVS: Environment-Native Verified Search for Long-Horizon GUI Agents
Researchers introduce ENVS (Environment-Native Verified Search), a novel training approach for GUI agents that discovers verified action trajectories in live desktop environments before policy optimization. The method achieves 30.3 pass@8 on OSWorld benchmarks while reducing computational requirements by 25-28% compared to existing reinforcement learning approaches, and demonstrates robust performance even under simulated desktop interruptions.
ENVS addresses a fundamental challenge in developing autonomous GUI agents: efficiently discovering reliable action sequences in complex, high-dimensional desktop environments. Rather than relying on costly VM rollouts and delayed feedback, the approach uses environment-native verification during training to construct high-quality supervision signals. This represents a meaningful shift from traditional online reinforcement learning paradigms toward more efficient search-and-filter architectures that leverage real environmental feedback before optimization.
The introduction of OSWorld-Noisy extends evaluation methodologies beyond static benchmarks, testing agent resilience against realistic desktop perturbations like interruptions and distractions. This reflects growing recognition that production-grade autonomous agents must handle messy, dynamic environments rather than idealized laboratory conditions. The ability to maintain visual reasoning while improving task completion suggests ENVS preserves learned capabilities during robustness training—a critical consideration for agents deployed in real-world settings.
The computational efficiency gains (25-28% reduction in GPU-hours) combined with superior performance improvements suggest architectural advantages that scale. ENVS maintains 27.0 pass@8 using only 30% of search data, indicating the method extracts more value from training data than baseline approaches. This efficiency matters significantly for organizations deploying autonomous agents, as reduced computational requirements lower infrastructure costs and environmental impact.
Looking forward, the research establishes benchmarking practices for robustness evaluation that the broader agent development community may adopt. The work demonstrates that environment-native verification during training can compete with or exceed online RL methods while consuming fewer resources—potentially accelerating adoption of autonomous GUI agents across enterprise applications.
- →ENVS achieves 30.3 pass@8 on OSWorld while using 25-28% less compute than online RL baselines
- →OSWorld-Noisy benchmark introduces systematic evaluation of agent resilience against realistic desktop interruptions
- →Search-and-filter approach with verified supervision outperforms traditional online reinforcement learning paradigms
- →ENVS maintains visual reasoning abilities while improving robustness, suggesting preserved learned capabilities
- →Method reaches 27.0 pass@8 with only 30% of training data, indicating improved sample efficiency