AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce SkillHarness, a framework enabling computer-use agents to safely learn and reuse skills in dynamic environments by constraining skill learning against adversarial attacks and environmental disruptions. The system reduces unsafe skill rates by 57.1% compared to existing approaches, addressing a critical vulnerability in AI agents deployed in interactive settings.
AINeutralarXiv – CS AI · Jun 97/10
🧠Researchers introduce WeaveBench, a comprehensive benchmark for evaluating computer-use agents across hybrid interfaces combining GUI, CLI, and code operations. The benchmark reveals significant capability gaps, with the best frontier models achieving only 41.2% success rates on 114 real-world tasks, indicating that current AI agents struggle with complex multi-interface orchestration.
AIBullisharXiv – CS AI · Jun 97/10
🧠Researchers introduce CUA-Gym, a scalable pipeline for generating verified training data for computer-use agents through co-generation of task instructions, environment states, and reward functions. The resulting dataset of 32,112 verified training tuples across 110 environments enables AI agents to achieve 62.1-72.6% performance on benchmarks, significantly advancing verifiable reinforcement learning for autonomous computer interaction.
AIBearisharXiv – CS AI · Jun 97/10
🧠Researchers have developed AutoElicit, a framework that automatically discovers unsafe behaviors in computer-use agents (CUAs) like Claude and Operator by iteratively perturbing benign instructions. The study reveals hundreds of severe unintended behaviors in state-of-the-art AI agents and demonstrates these vulnerabilities transfer across multiple frontier models, establishing the first systematic methodology for probing CUA safety risks.
🧠 Claude
AIBearisharXiv – CS AI · Jun 37/10
🧠Researchers introduced MedCUA-Bench, a new benchmark for evaluating AI agents performing clinical computer tasks across 18 medical scenarios. The benchmark reveals significant performance gaps, with top closed-source models achieving only 54.2% success and open-source agents averaging just 2.5%, highlighting the unpreparedness of current AI systems for reliable medical software automation.
AIBullisharXiv – CS AI · Jun 17/10
🧠Researchers introduce agent just-in-time (JIT) compilation, a system that compiles natural language task descriptions directly into executable code for computer-use agents, achieving 10.4x speedup and 28% higher accuracy compared to existing sequential approaches. The method combines planning, scheduling, and tool protocol innovations to reduce latency and errors in browser automation tasks.
🏢 OpenAI
AIBearisharXiv – CS AI · May 127/10
🧠Researchers expose critical flaws in Computer Use Agent (CUA) benchmarking, demonstrating that simple replay scripts outperform advanced AI models on current static benchmarks. The study introduces PRISM design principles and DigiWorld, a rigorous evaluation framework with 3.2 million verified configurations, establishing new standards for meaningful CUA assessment.
AIBearisharXiv – CS AI · Apr 147/10
🧠Researchers have identified a critical safety vulnerability in computer-use agents (CUAs) where benign user instructions can lead to harmful outcomes due to environmental context or execution flaws. The OS-BLIND benchmark reveals that frontier AI models, including Claude 4.5 Sonnet, achieve 73-93% attack success rates under these conditions, with multi-agent deployments amplifying vulnerabilities as decomposed tasks obscure harmful intent from safety systems.
🧠 Claude
AIBullisharXiv – CS AI · Mar 267/10
🧠Researchers released CUA-Suite, a comprehensive dataset featuring 55 hours of continuous video demonstrations across 87 desktop applications to train computer-use agents. The dataset addresses a critical bottleneck in developing AI agents that can automate complex desktop workflows, revealing current models struggle with ~60% task failure rates on professional applications.
AINeutralarXiv – CS AI · 1d ago6/10
🧠Researchers released Argus, a comprehensive benchmark for uncertainty quantification in AI agents that control computers through GUI interactions. The study evaluated 27 uncertainty methods across multiple vision-language models and datasets, finding that uncertainty rankings remain stable within a single model but degrade significantly when switching between different model classes or interfaces.
AIBullisharXiv – CS AI · 3d ago6/10
🧠Researchers introduce FaraGen1.5, a scalable data pipeline for training computer use agents that combines live websites and synthetic environments with multiple verifiers. The resulting Fara1.5 family of agents achieves state-of-the-art performance across three model sizes (4B-27B parameters), with the 27B variant matching much larger proprietary systems on benchmark tasks.
🧠 GPT-5
AINeutralarXiv – CS AI · 3d ago6/10
🧠ChainWorld introduces a new evaluation framework that composes atomic OSWorld tasks into longer, multi-step desktop workloads to better assess computer use agents in realistic scenarios. Testing across four models reveals maximum chain completion rates of only 31%, with distinct failure patterns between single-turn and multi-turn evaluation protocols.
AINeutralarXiv – CS AI · Jun 56/10
🧠Researchers introduce NOVA, a security architecture for Computer Use Agents that prevents prompt injection attacks through upfront branching plans and architectural isolation. The system maintains up to 57% performance parity with frontier models while improving smaller models by 19%, though new vulnerabilities like Branch Steering attacks remain.
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce PRO-CUA, a reinforcement learning framework that improves training of computer use agents (AI systems that automate digital workflows) by using step-level process rewards instead of trajectory-level feedback. The method reduces training costs and distribution shift while achieving better performance on live web benchmarks.
AINeutralarXiv – CS AI · May 286/10
🧠Researchers introduce LearnWeak, a framework that improves small computer-use agents by having them learn from their own failures in specific domains rather than training on generic synthetic data. The approach achieves 11-12 percentage point improvements on benchmark tests, demonstrating that targeted, error-aware specialization is more efficient than broad data synthesis for adapting AI agents to particular software environments.
AINeutralarXiv – CS AI · Apr 146/10
🧠Researchers introduced HealthAdminBench, a new evaluation framework with 135 tasks across realistic healthcare administration workflows, revealing that current AI agents achieve only 36.3% end-to-end success despite strong individual subtask performance. The benchmark demonstrates a critical gap between AI capabilities and the reliability requirements for automating healthcare administrative processes worth over $1 trillion annually.
🧠 GPT-5🧠 Claude🧠 Opus