AINeutralarXiv – CS AI · 10h ago6/10
🧠
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
Researchers present a new evaluation protocol for AI pentesting agents that moves beyond simplified benchmarks to assess real-world vulnerability discovery capabilities. The framework combines structured ground-truth validation with LLM-based semantic matching and includes efficiency metrics, addressing a critical gap in how offensive security AI systems are currently measured.