PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits
Researchers introduce PLAGUE, a framework for conducting multi-turn jailbreak attacks on Large Language Models through a three-phase approach (Primer, Planner, Finisher). The framework achieves unprecedented attack success rates of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, demonstrating significant vulnerabilities in models considered highly resistant to jailbreaking.
PLAGUE represents a methodological advance in adversarial AI research, addressing a critical gap in multi-turn attack strategies where harmful intent accumulates across conversation turns. Traditional single-turn jailbreak research has proven insufficient for understanding vulnerabilities in agentic workflows, where LLMs engage in extended dialogues to complete complex tasks. This framework systematizes the attack lifecycle into discrete phases, enabling researchers to incrementally build attack sophistication while optimizing context and information flow.
The security implications extend beyond academic interest. As LLMs become embedded in autonomous agents handling sensitive operations—from financial transactions to system administration—the susceptibility to multi-turn exploitation poses genuine risks. The 30% improvement in attack success rates over existing methods, combined with breakthrough results against previously robust models, indicates that current safety measures inadequately address conversational attack vectors.
For the AI industry, these findings create tension between capability expansion and safety assurance. Developers deploying multi-turn systems must confront that state-of-the-art models remain vulnerable despite substantial safety investments. The research provides critical red-teaming insights that responsible AI organizations should integrate into their security protocols. However, the framework's accessibility as a plug-and-play tool raises concerns about weaponization potential.
Looking ahead, the security community faces pressure to develop defensive mechanisms specifically designed for multi-turn interactions. The research suggests that incremental safety improvements may prove insufficient—fundamental architectural changes may be necessary. Regulatory bodies evaluating AI deployment will likely scrutinize multi-turn robustness, potentially influencing model certification standards and enterprise adoption timelines.
- →PLAGUE achieves 81.4% attack success on OpenAI's o3 and 67.3% on Claude Opus 4.1, breaking previously assumed safety barriers
- →Multi-turn jailbreaking remains severely understudied despite the prevalence of agentic LLM workflows in production systems
- →The framework's three-phase approach (Primer, Planner, Finisher) systematizes attack design and enables 30% improvement over existing methods
- →Current LLM safety measures inadequately address conversational attack vectors where harmful intent accumulates incrementally
- →Results indicate fundamental architectural changes may be necessary to defend against sophisticated multi-turn exploitation strategies