AINeutralarXiv – CS AI · 15h ago6/10
🧠
JobBench: Aligning Agent Work With Human Will
Researchers introduce JobBench, a new AI agent benchmark that evaluates 36 models across 130 tasks in 35 occupations based on what humans actually want delegated rather than pure economic value. The strongest model, Claude Opus, achieves only 45.9% accuracy, revealing significant gaps in current AI agent capabilities for real-world professional workflows.
🧠 Claude