Are Frontier LLMs Ready for Cybersecurity? Evidence for Vertical Foundation Models from Dual-Mode Vulnerability Benchmarks
A comprehensive evaluation of frontier large language models for cybersecurity tasks reveals they struggle with high false positive rates (10-50%) in vulnerability detection and achieve only 4-8% accuracy in black-box testing, suggesting that specialized domain training and structured methodology matter more than model scale for security applications.
The research challenges the prevailing assumption that scaling frontier LLMs automatically solves complex domain problems like cybersecurity. Testing six top-tier models including GPT-5.4, Claude Opus, and Gemini variants exposed a critical gap: these general-purpose models generate excessive false positives while missing real vulnerabilities at scale. The black-box testing results are particularly damning, with even the most capable frontier models achieving single-digit ground-truth vulnerability coverage.
This finding reflects a broader pattern in AI deployment where raw capability does not translate to reliable performance in specialized, safety-critical domains. Cybersecurity differs fundamentally from general language tasks—it requires understanding attack chains, system architecture dependencies, and context-specific threat modeling. Frontier models trained on diverse internet data lack the structured, failure-heavy, sequential reasoning traces necessary for this domain.
The research demonstrates that domain-specialized models significantly outperform their generalist counterparts, with a defense model achieving 90.4% precision and 9.7% false positive rates on a single GPU. This directly contradicts the scaling hypothesis and suggests the industry should pursue vertical foundation models tailored to specific security use cases rather than assuming frontier LLMs can be prompt-engineered into competence.
The proposed self-play security testing data generation strategy addresses a root cause: training data scarcity in structured end-to-end attack sequences. Organizations considering AI-powered security tools should recognize that frontier LLM integration without domain specialization introduces substantial risk through both missed vulnerabilities and security alert fatigue from false positives.
- →Frontier LLMs produce 10-50% false positive rates in vulnerability detection, systematically over-predicting security issues
- →Black-box testing achieves only 4-8% vulnerability detection rates, even when combined with external security tools like Burp Suite
- →Domain-specialized models substantially outperform frontier LLMs, achieving 90.4% precision versus the broader models' inconsistent results
- →Structured methodology and security-specific training data matter more than model scale for cybersecurity effectiveness
- →The shortage of failure-heavy, multi-step attack chain training data represents the fundamental bottleneck limiting current AI security tools