How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency
Researchers conducted 400 autonomous penetration testing runs across four LLM models against a fixed vulnerable target to measure attack consistency. Results show significant variation in exploitation success rates (25-85%) and distinctive failure modes per model, with Claude and Gemini 2.5 Flash-Lite substantially outperforming GPT-4o-mini and Qwen, raising critical questions about LLM reliability in security-critical autonomous operations.
This empirical study addresses a fundamental gap in LLM security research: whether autonomous AI-driven attacks demonstrate consistent behavior when repeatedly targeting identical vulnerabilities. The researchers executed 100 penetration testing runs per model against OWASP Juice Shop and companion services, holding all variables constant except the LLM itself. Results reveal striking disparities in both success rates and failure modes, with Gemini 2.5 Flash-Lite achieving 85% full exploitation versus Qwen's 25%, despite identical attack objectives and target configurations.
The findings expose model-specific vulnerabilities in orchestration and decision-making. Claude experienced upstream API truncation (39 failures), Qwen exhibited premature completion loops (52 failures), and GPT-4o-mini exhausted iteration budgets. Cross-service credential reuse appeared exclusively in models retaining extensive conversation history, suggesting memory management directly impacts attack sophistication. The statistically significant differences (p < 0.001) with large effect sizes indicate these variations are fundamental architectural characteristics rather than random noise.
For the security and AI development communities, this work demonstrates that autonomous LLM attack behavior remains unreliable and unpredictable at current capability levels. Organizations deploying LLMs for security-critical functions face uncertain outcomes; a model succeeding 85% of attempts versus 25% creates operational risk assessment challenges. The 15-30 second exploitation window suggests attack speed remains consistent even when success rates diverge dramatically.
Looking forward, this research should prompt development of robust LLM orchestration frameworks, improved error recovery mechanisms, and standardized benchmarking protocols for autonomous security operations. Understanding why certain models fail in specific ways enables targeted improvements, but current inconsistency argues against deploying untested LLM agents in production security environments.
- βGemini 2.5 Flash-Lite achieved 85% exploitation success while Qwen managed only 25%, revealing substantial model-dependent reliability gaps in autonomous attacks.
- βEach LLM exhibits distinctive failure modes: Claude through API truncation, Qwen through premature loops, and GPT-4o-mini through iteration exhaustion.
- βNo model resisted content refusals beyond one-shot re-prompting, indicating current safety guardrails prove ineffective under orchestrated attack scenarios.
- βSuccess rate differences across models are statistically significant with large effect sizes, confirming these are fundamental architectural characteristics rather than random variation.
- βCredential reuse behavior correlates directly with conversation history retention, suggesting memory management architecture significantly influences attack sophistication.