Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection
Researchers demonstrate that multi-turn prompt injection attacks leave detectable signatures in language model activation patterns, achieving 93.8% detection accuracy through analysis of residual stream trajectories. The approach reveals that adversarial attack sequences exhibit distinctive 'restlessness' patterns across model architectures, though detection effectiveness varies significantly when deployed on real-world data.