Evaluating LLMs for Real-World Web Vulnerability Detection
Researchers benchmarked six large language models on their ability to detect real-world web vulnerabilities in WordPress plugins, finding that while all models can identify security issues, detection rates vary significantly (35-63%) and no model maintains consistent results across repeated tests. The findings reveal both the promise and critical limitations of LLM-based vulnerability detection for security practitioners.
The study addresses a critical gap in understanding LLM capabilities for cybersecurity applications. As organizations increasingly consider deploying AI tools for vulnerability detection, this research provides empirical evidence that while models like Claude Opus 4.6 show promise with 63% detection rates, their reliability remains problematic. The inability of any model to achieve consistent reporting across iterations—with some performing as low as 50% consistency—raises serious concerns about using LLMs as primary security tools without human oversight.
This research reflects the broader trend of evaluating frontier AI models against specialized security tasks. Unlike generic language benchmarks, vulnerability detection demands both code comprehension and deep security domain knowledge. The finding that scoped prompts outperform open-ended ones suggests LLMs benefit from constraint, but the modest improvement margins indicate fundamental limitations rather than engineering solutions.
For security practitioners and enterprises, these results carry significant implications. Organizations considering AI-assisted vulnerability scanning must recognize that even top-performing models miss critical vulnerabilities and produce inconsistent outputs. This inconsistency could create false confidence, leading to overlooked security issues. The gap between frontier and open-weight models (63% vs 35-48%) also highlights the trade-offs between capability and deployment flexibility.
Looking ahead, the field should focus on understanding why consistency fails and developing hybrid approaches combining LLM strengths with traditional static analysis tools. The publication of code and data will likely spawn follow-up research examining specific vulnerability types and exploring ensemble methods. Security practitioners should treat LLM-based detection as a supplementary tool rather than primary mechanism for now.
- →Claude Opus 4.6 achieved the highest detection rate at 63%, but all models showed significant inconsistency across repeated tests.
- →Scoped prompts with narrow vulnerability scope outperformed open-ended prompts, while prompt complexity had minimal impact on results.
- →No LLM model maintained consistent reporting across three experiment iterations, with some as low as 50% consistency rates.
- →Open-weight models like MiniMax M2.5 performed comparably to frontier models (48%), offering potential cost-benefit advantages.
- →LLMs demonstrated fundamental limitations, failing to detect at least one baseline vulnerability in real-world plugins.