CVE-Factory: Scaling Expert-Level Agentic Tasks for Code Security Vulnerability
CVE-Factory is an automated multi-agent framework that transforms vulnerability metadata into executable security tasks with expert-level quality, achieving 95% correctness and enabling the creation of LiveCVEBench—a continuously updated benchmark of 190 security tasks across 14 programming languages that advances AI code security evaluation.
CVE-Factory addresses a critical gap in AI security research by automating the labor-intensive process of converting sparse CVE (Common Vulnerabilities and Exposures) data into high-quality, reproducible security tasks. Traditionally, vulnerability reproduction requires extensive manual effort from domain experts, creating bottlenecks that prevent systematic evaluation of code security agents. The framework's achievement of 95% solution correctness and 96% environment fidelity demonstrates that automation can match human expert standards while dramatically reducing resource requirements.
The security AI landscape has struggled with outdated and limited evaluation datasets that don't reflect emerging threat vectors. LiveCVEBench solves this by maintaining a continuously updated corpus of 190 real-world vulnerabilities spanning 14 languages and 153 repositories, including novel AI-tooling vulnerabilities. This directly addresses how threat landscapes evolve faster than benchmarks can typically adapt, ensuring evaluation frameworks remain relevant to actual security challenges developers face.
The practical implications extend beyond research metrics. By synthesizing over 1,000 executable training environments, CVE-Factory enables large-scale model training in code security—a previously impractical endeavor. The results prove significant: fine-tuned Qwen3-32B achieved 35.8% performance on LiveCVEBench, surpassing Claude 4.5 Sonnet and demonstrating generalization to other benchmarks. This suggests that specialized, vulnerability-focused training substantially outperforms general-purpose models on security tasks, informing investment priorities for security-focused AI development.
Looking forward, the open-sourcing of CVE-Factory, training data, and leaderboard establishes a new baseline for security AI research. The continuous updating mechanism positions this work to track emerging vulnerabilities in real-time, making it a valuable resource for both researchers and practitioners seeking to benchmark and improve code security capabilities across diverse programming ecosystems.
- →CVE-Factory automates vulnerability task creation with 95% expert-level correctness, eliminating manual reproduction bottlenecks.
- →LiveCVEBench provides 190 continuously updated vulnerability benchmarks across 14 languages, capturing emerging security threats including AI-tooling exploits.
- →Fine-tuned Qwen3-32B achieved 35.8% performance on LiveCVEBench, exceeding Claude 4.5 Sonnet on code security vulnerability tasks.
- →The framework synthesizes over 1,000 executable training environments, enabling large-scale scaling of agentic security training previously infeasible at this magnitude.
- →Open-source release of all resources establishes a new research standard for benchmarking and advancing AI code security capabilities.