AIBearisharXiv โ CS AI ยท 9h ago7/10
๐ง
EvoClaw: Evaluating AI Agents on Continuous Software Evolution
Researchers introduce EvoClaw, a new benchmark that evaluates AI agents on continuous software evolution rather than isolated coding tasks. The study reveals a critical performance drop from >80% on isolated tasks to at most 38% in continuous settings across 12 frontier models, highlighting AI agents' struggle with long-term software maintenance.