AIBullisharXiv – CS AI · 4d ago7/10
🧠SafeSpec is a new speculative inference framework that integrates safety guardrails directly into LLM decoding acceleration without sacrificing speed gains. The method uses a lightweight safety head to detect unsafe outputs and applies reflective sampling to recover safe continuations, achieving a 15% reduction in attack success rates while maintaining 2.06x speedup on standard workloads.
AIBearisharXiv – CS AI · 4d ago7/10
🧠Researchers introduced NRT-Bench, a multi-turn red-teaming benchmark testing LLM agents in a simulated nuclear power plant control room. The study found that adaptive adversarial attacks succeeded in compromising critical safety functions in 8.7-12.1% of sessions across four frontier models, with vulnerabilities distributed unevenly across models rather than shared, raising concerns about LLM reliability in safety-critical deployments.
AINeutralarXiv – CS AI · May 277/10
🧠Researchers propose SALO, a jailbreak detection method that identifies persistent 'refusal trajectories' across model layers, rather than relying on static terminal representations. The detector demonstrates improved detection rates against adversarial attacks on multiple LLM architectures, though with acknowledged limitations against adaptive attacks.
🧠 Llama
AINeutralarXiv – CS AI · May 97/10
🧠Researchers have developed TurnGate, a defense system that detects multi-turn dialogue attacks where malicious intent is distributed across multiple conversation turns rather than exposed in a single prompt. The study introduces the Multi-Turn Intent Dataset (MTID) and demonstrates that the system outperforms existing baselines while maintaining low false-positive refusal rates.
AINeutralarXiv – CS AI · Jun 86/10
🧠Researchers demonstrate that diffusion language models exhibit superior jailbreak robustness compared to autoregressive models due to their sampling mechanisms' ability to recover from harmful intermediate generations. They introduce a Step-Wise Refusal Internal Dynamics (SRI) signal that enables effective jailbreak detection without modifying inference, generalizing to unseen attacks.
AINeutralarXiv – CS AI · Jun 56/10
🧠GuardNet, an ensemble-based detection system using shallow neural networks, demonstrates competitive performance in identifying prompt injection and jailbreak attacks on large language models while operating at 50ms latency suitable for production deployment. Although larger LLMs outperform it on some benchmarks, GuardNet achieves strong results (0.747 AUROC) with significantly lower computational overhead, challenging the assumption that adversarial robustness requires massive model scale.
🧠 Llama
AINeutralarXiv – CS AI · May 296/10
🧠Researchers introduce Temporal Logit Observability (TLO), a training-free diagnostic tool that reveals how LLM jailbreak attacks unfold over time by analyzing logit patterns during decoding, rather than just whether attacks succeed. The method identifies that attacks with identical success rates actually follow different failure pathways, enabling better safety evaluation and early-stopping defenses that reduce successful jailbreaks by over 50%.
AIBullisharXiv – CS AI · May 296/10
🧠Researchers introduce Opir, a family of efficient encoder-based safety classification models designed to detect toxic content, jailbreaks, and harmful prompts in LLM applications without requiring expensive large guardrail models. The models achieve competitive performance across 12 safety tasks against eight contemporary systems while maintaining significantly smaller deployment footprints, with edge variants containing fewer than 100M parameters.