AIBullisharXiv – CS AI · 6h ago7/10
🧠
STAGE-Claw: Automated State-based Agent Benchmarking for Realistic Scenarios
Researchers introduce STAGE-Claw, an automated framework for evaluating AI agents in realistic personal-computing environments by measuring actual system state changes rather than textual responses. The framework creates 40 benchmark tasks and evaluates 11 frontier models, addressing critical gaps in how large language model agents are currently assessed.