REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak
Researchers introduce Reflector, a two-stage framework that enhances LLM safety by embedding self-reflection directly into the generation process rather than relying on surface-level alignment. The method achieves over 90% defense rates against sophisticated multi-step jailbreak attacks while improving general model performance by 5.85% on math benchmarks.
Reflector addresses a critical vulnerability in contemporary large language models: their susceptibility to indirect jailbreak attacks that manipulate the internal generation process rather than exploiting obvious surface-level safety mechanisms. Traditional safety alignment approaches operate as external guardrails, leaving models vulnerable to adversaries who understand how LLMs generate outputs step-by-step. This research represents a meaningful shift toward proactive, internalized safety architecture.
The framework operates through two complementary stages. First, teacher-guided generation creates high-quality reflection data used for supervised fine-tuning, establishing structured patterns where models learn to self-examine their reasoning. Second, reinforcement learning with outcome-driven rewards trains models to autonomously detect and reject problematic trajectories during generation itself. This internalization approach fundamentally differs from post-hoc safety filtering.
The empirical results suggest substantial practical benefits. Defense success rates exceeding 90% against complex indirect attacks indicate robust performance across diverse threat scenarios. Critically, safety improvements don't compromise utility—the framework actually enhances performance on knowledge-intensive tasks like mathematics, contradicting the common assumption that safety measures reduce model capability.
For the AI development community, this work signals that trajectory-level safety mechanisms may offer scalable alternatives to computationally expensive safety approaches. The method's generalization across different attack types suggests applicability beyond narrow threat models. As LLM deployment expands into sensitive domains, internalized safety mechanisms that preserve capability while defending against sophisticated attacks become increasingly valuable. Future research likely focuses on scaling this approach to larger models and understanding how reflection patterns transfer across different domains.
- →Reflector embeds self-reflection into the LLM generation process itself rather than relying on external safety filters
- →The framework achieves 90%+ defense rates against sophisticated multi-step jailbreak attacks across diverse scenarios
- →Safety improvements correlate with utility gains, including 5.85% performance improvement on mathematical reasoning benchmarks
- →Two-stage training combines supervised fine-tuning for structured reflection patterns with reinforcement learning for autonomous safety
- →This internalized approach addresses fundamental limitations of surface-level alignment without significant computational overhead