Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success
A research paper argues that jailbreak attack evaluations should report distributional success rates across parameter configurations rather than single best-case scenarios. The authors propose two new metrics—Variant Sensitivity Measure (VSM) and Union Coverage (UC)—and demonstrate that attacks covering 81% in optimal configuration reach 100% coverage when all variants are tested, fundamentally changing threat assessments.
This paper addresses a methodological gap in AI safety research that has significant implications for how the security community evaluates language model vulnerabilities. Current practice in jailbreak research cherry-picks optimal parameter settings, creating a misleading picture of attack effectiveness that obscures the true threat surface. By testing PAIR and bijection attacks across multiple configurations on models like Mistral-7B and Qwen3, researchers reveal substantial performance variance—gaps of up to 19 percentage points between best and average configurations. This variability matters because defenders must understand not just peak attack success but the distribution of outcomes across the parameter space attackers might explore. The Union Coverage metric particularly highlights this concern: an 81% attack success rate under optimal conditions becomes 100% when accounting for all 36 tested variants, suggesting current single-configuration reporting fundamentally understates attack power. This research reflects broader trends in AI safety toward more rigorous, probabilistic threat modeling rather than deterministic worst-case analysis. The framework directly impacts how security researchers prioritize defenses and how AI companies assess model robustness. If adopted as a standard, distributional reporting would likely reveal that many published defenses are less effective than claimed, potentially accelerating investment in more resilient safety mechanisms. The work signals that the AI safety evaluation ecosystem is maturing toward more sophisticated benchmarking practices, similar to how security research in traditional domains matured past simple success/failure metrics.
- →Current jailbreak evaluations using single-configuration metrics systematically understate attack success by ignoring parameter variation.
- →Union Coverage metrics reveal attacks reaching 100% success rate across variants despite appearing far less effective in optimal configuration scenarios.
- →The Variant Sensitivity Measure quantifies performance deviation, exposing how parameter choices dramatically affect attack reliability.
- →Distributional reporting standards would change threat assessment frameworks across AI safety research and defense prioritization.
- →Adoption of these metrics could reveal previously published defenses are significantly less robust than single-configuration benchmarks suggest.