SALLIE: Safeguarding Against Latent Language & Image Exploits
Researchers introduce SALLIE, a lightweight runtime defense framework that detects and mitigates jailbreak attacks and prompt injections in large language and vision-language models simultaneously. Using mechanistic interpretability and internal model activations, SALLIE achieves robust protection across multiple architectures without degrading performance or requiring architectural changes.
The vulnerability of large language models and vision-language models to adversarial attacks represents a critical barrier to safe AI deployment in production environments. Current defense mechanisms typically operate through cumbersome input transformations that sacrifice model performance or address textual and visual threats independently, leaving systems incompletely protected. SALLIE addresses this fundamental gap by implementing a unified, runtime detection approach that operates at the inference stage without modifying model architecture or requiring retraining.
The framework leverages mechanistic interpretability—a growing field that interprets AI decision-making through internal model components—to extract meaningful threat signals directly from residual stream activations. This three-stage pipeline uses layer-wise k-nearest neighbor classifiers to identify malicious patterns and aggregates predictions across multiple layers for robust detection. The researchers tested SALLIE on practical, resource-efficient models like Phi-3.5-vision and SmolVLM2, reflecting real-world deployment constraints where computational budgets remain tight.
For developers and enterprises deploying multimodal AI systems, this advancement offers a practical security layer that maintains inference speed while defending against increasingly sophisticated attacks. The comprehensive evaluation across ten datasets and comparison against five baseline methods demonstrates SALLIE's effectiveness. The approach's compatibility with standard token-level fusion pipelines means adoption barriers remain low for existing model implementations. As AI systems face mounting adversarial pressure, runtime detection frameworks that preserve performance while strengthening robustness become essential infrastructure rather than optional enhancements.
- →SALLIE provides unified defense against both textual and visual jailbreaks without performance degradation or architectural modifications
- →The framework uses mechanistic interpretability to extract threat signals from model internal activations during inference
- →Testing on efficient models like Phi-3.5-vision and SmolVLM2 prioritizes practical deployment and real-world inference costs
- →Comprehensive evaluation across ten datasets shows SALLIE outperforms five baseline defense methods consistently
- →Runtime detection approach integrates seamlessly into existing token-level fusion pipelines with minimal adoption friction