AINeutralarXiv – CS AI · 14h ago6/10
🧠
Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures
Researchers introduce Temporal Logit Observability (TLO), a training-free diagnostic tool that reveals how LLM jailbreak attacks unfold over time by analyzing logit patterns during decoding, rather than just whether attacks succeed. The method identifies that attacks with identical success rates actually follow different failure pathways, enabling better safety evaluation and early-stopping defenses that reduce successful jailbreaks by over 50%.