Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection
Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.
This research addresses a critical vulnerability in how language models are currently evaluated for safety. Traditional refusal detection methods examine final-layer representations, assuming they capture the model's decision to refuse harmful requests. However, this work reveals that sophisticated attacks like GCG (Greedy Coordinate Gradient) can suppress these terminal signals while leaving upstream activation patterns intact—a phenomenon the authors term 'refusal trajectories.' This discovery has important implications for AI safety: it suggests current jailbreak detection mechanisms may provide false confidence by missing suppressed refusal pathways. SALO's approach of monitoring sparse activation patterns across layer-token positions offers a more granular view of how models internally construct refusal decisions. The detector shows consistent improvements across Qwen, Llama, and Mistral models, indicating the findings generalize across architecture families. However, the authors carefully document limitations, including reduced effectiveness against adaptive GCG attacks and challenges with encoded inputs. This transparency is valuable for the safety community. The work matters because it directly impacts model deployment security—organizations relying on static representation-based detection could face vulnerabilities. As adversarial techniques evolve, detection methods must similarly advance. For developers and safety teams, this signals that single-layer monitoring is insufficient and that multi-layered, trajectory-based approaches may better protect production systems. The research establishes a new baseline for jailbreak detection research while highlighting the ongoing arms race between attack and defense mechanisms in AI safety.
- →SALO identifies persistent refusal patterns across model layers that survive attacks suppressing terminal refusal signals
- →Current static representation-based detection methods may miss sophisticated jailbreak attempts
- →The detector improves performance across Qwen, Llama, and Mistral models under standardized testing conditions
- →Adaptive attacks can reduce SALO's effectiveness, indicating ongoing adversarial evolution
- →Multi-layer activation monitoring provides better safety assurance than terminal-layer-only approaches