AINeutralarXiv โ CS AI ยท 5h ago
๐ง
SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration
Researchers introduce SWE-CI, a new benchmark that evaluates AI agents' ability to maintain codebases over time through continuous integration processes. Unlike existing static bug-fixing benchmarks, SWE-CI tests agents across 100 long-term tasks spanning an average of 233 days and 71 commits each.