AINeutralarXiv โ CS AI ยท 7h ago6/10
๐ง
Capture the Flags: Family-Based Evaluation of Agentic LLMs via Semantics-Preserving Transformations
Researchers introduce Evolve-CTF, a tool that generates families of semantically-equivalent cybersecurity challenges to evaluate the robustness of agentic LLMs. Testing 13 LLM configurations reveals models are resilient to basic code transformations but struggle with obfuscation and composed modifications, providing new benchmarking methodology for AI safety evaluation.