Beyond Pass/Fail: Using Process Mining to Understand How LLMs Resist (and Fail) Red Team Attacks
Researchers applied process mining techniques to red team attack logs against large language models, revealing that standard attack success rate metrics mask critical differences in how models defend themselves. GPT-OSS 120B exhibits a near-absorbing refusal state, while Llama 3.3 70B shows multiple escape routes from refusal, with mutator effectiveness varying significantly across models.
This research addresses a fundamental gap in AI safety evaluation methodology. While red teaming has become standard practice for assessing LLM robustness, reducing results to a single binary metric—attack success rate—obscures the underlying mechanisms of model behavior during adversarial interactions. By applying process mining to 8,575 scored events from 60 different attacks, the researchers uncovered architectural and training-induced differences that a pass/fail framework completely misses.
The findings demonstrate that LLMs employ structurally distinct defense strategies. GPT-OSS enters a near-absorbing refusal state, suggesting its training created a powerful but potentially brittle safety mechanism—once activated, attacks struggle to escape this state. Conversely, Llama 3.3 maintains multiple pathways from refusal back to compliance, indicating either weaker safety anchoring or more flexible model behavior that adversaries can exploit more readily. These differences matter because they reveal not just whether attacks succeed, but how and why they succeed or fail.
For the AI safety community, this work provides tools to profile model defenses more granularly and identify which safety approaches prove most resilient under sustained attack. The asymmetric mutator effectiveness and order-of-magnitude differences in time-to-jailbreak suggest that attack patterns are model-specific rather than universal. This has direct implications for developers deploying these models—understanding your model's specific vulnerability profile enables targeted defense improvements rather than generic hardening.
Future red teaming evaluations should adopt process-level analysis to move beyond binary outcomes. Regulatory frameworks assessing AI safety would benefit from similar structural insights, as they currently lack the nuance to differentiate between models with fundamentally different defense characteristics.
- →Process mining reveals structurally distinct defense profiles invisible to standard attack success rate metrics.
- →GPT-OSS exhibits a near-absorbing refusal state while Llama 3.3 maintains multiple escape routes from refusal.
- →Attack mutator effectiveness varies asymmetrically across models, suggesting vulnerability profiles are model-specific.
- →Time-to-jailbreak distributions differ by an order of magnitude between tested models.
- →Current red teaming evaluation methods inadequately characterize how LLMs defend against sequential adversarial campaigns.