AIBullisharXiv – CS AI · 8h ago7/10
🧠
CyberGym-E2E: Scalable Real-World Benchmark for AI Agents' End-to-End Cybersecurity Capabilities
Researchers introduce CyberGym-E2E, a large-scale benchmark with 920 real-world vulnerabilities that evaluates AI agents across the complete vulnerability lifecycle—discovery, proof-of-concept generation, and patch creation. This addresses a critical gap in cybersecurity AI evaluation by testing end-to-end remediation capabilities rather than isolated tasks, establishing a new standard for measuring autonomous vulnerability management systems.