CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities
Researchers introduce CyberGym-E2E, a large-scale benchmark with 920 real-world vulnerabilities that evaluates AI agents across the complete vulnerability lifecycle—discovery, proof-of-concept generation, and patch creation. This addresses a critical gap in cybersecurity AI evaluation by testing end-to-end remediation capabilities rather than isolated tasks, establishing a new standard for measuring autonomous vulnerability management systems.
CyberGym-E2E represents a significant maturation in how the AI research community evaluates autonomous cybersecurity capabilities. Previous benchmarks focused on fragmented tasks—vulnerability detection OR patch generation—rather than the interconnected workflow that security teams actually perform. This new benchmark with 920 vulnerabilities from 139 open-source projects captures realistic scenarios where AI agents must detect flaws, understand their exploitability, generate working proofs-of-concept, and create effective patches.
The development reflects growing recognition that AI's role in cybersecurity extends beyond narrow detection tasks. As enterprise organizations increasingly explore autonomous vulnerability remediation, having standardized evaluation metrics becomes essential for comparing AI systems and understanding their real-world applicability. The automated pipeline for transforming vulnerability data into evaluation environments also addresses scalability—a persistent challenge in cybersecurity benchmarking where creating realistic, reproducible test cases typically requires manual effort.
For the AI and cybersecurity industries, this benchmark catalyzes progress toward autonomous security operations. Development teams and security vendors can now objectively measure their AI agent capabilities against standardized criteria, accelerating iteration cycles. Organizations evaluating AI-powered security solutions gain a reference framework for assessing vendor claims.
The broader implication extends to enterprise digital infrastructure defense. As AI agents demonstrate competency across complete vulnerability lifecycles, organizations may gradually shift from reactive incident response toward proactive, AI-driven remediation. This could materially reduce mean-time-to-patch metrics and expand security coverage to resource-constrained teams. Future versions of this benchmark with expanded vulnerability sets and diverse software ecosystems will be critical for validating generalization across different codebases and programming paradigms.
- →CyberGym-E2E evaluates AI agents across the complete vulnerability lifecycle, not isolated cybersecurity tasks, establishing a more realistic testing standard.
- →The benchmark contains 920 real-world vulnerabilities spanning 139 open-source projects, providing substantial scale for rigorous AI agent evaluation.
- →Automated pipeline methodology enables scalable transformation of vulnerability data into evaluation environments, addressing historical bottlenecks in cybersecurity benchmarking.
- →Standardized evaluation metrics accelerate development of autonomous vulnerability management systems and enable objective comparison between competing AI solutions.
- →Successful AI performance on end-to-end vulnerability remediation could enable organizations to significantly reduce time-to-patch and expand security coverage.