AINeutralarXiv – CS AI · 7h ago6/10
🧠
DeskCraft: Benchmarking Desktop Agents on Professional Workflows and Human-in-the-Loop Collaboration
Researchers introduced DeskCraft, a new benchmark for evaluating AI desktop agents on complex, long-horizon professional workflows in creative and engineering software. The study reveals significant performance gaps, with GPT-4 achieving only 31.6% accuracy on standard tasks and 27.6% on interactive tasks requiring human collaboration, highlighting challenges in multi-step automation and proactive agent communication.
🧠 GPT-5