The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring
Researchers present a comprehensive framework for systematically generating, categorizing, and evaluating jailbreak attacks against large language models, introducing a dataset of 114,000 adversarial prompts, automated generation methods, and a novel continuous evaluation metric (OPTIMUS) that surpasses binary success rate measurements.

