AINeutralarXiv – CS AI · 18h ago6/10
🧠
Beyond Pass/Fail: Using Process Mining to Understand How LLMs Resist (and Fail) Red Team Attacks
Researchers applied process mining techniques to red team attack logs against large language models, revealing that standard attack success rate metrics mask critical differences in how models defend themselves. GPT-OSS 120B exhibits a near-absorbing refusal state, while Llama 3.3 70B shows multiple escape routes from refusal, with mutator effectiveness varying significantly across models.
🧠 Llama