#adversarial-detection News & Analysis

5 articles tagged with #adversarial-detection. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBullisharXiv – CS AI · May 297/10

🧠

KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

KYA (Know Your Agents) is an open-source trust and governance framework for autonomous systems that enables verifiable authorization, policy compliance, and post-hoc auditability across multi-agent environments. The system demonstrates strong security performance, detecting 89% of adversarial attacks while maintaining sub-millisecond latency and supporting 15+ agent frameworks.

AIBullisharXiv – CS AI · May 117/10

🧠

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Researchers propose SAEgis, a lightweight adversarial attack detection framework using sparse autoencoders (SAEs) to protect vision-language models from adversarial perturbations. The plug-and-play method requires no additional adversarial training and demonstrates strong cross-domain generalization, addressing a critical safety gap in increasingly deployed VLM systems.

AINeutralarXiv – CS AI · May 17/10

🧠

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Researchers demonstrate that multi-turn prompt injection attacks leave detectable signatures in language model activation patterns, achieving 93.8% detection accuracy through analysis of residual stream trajectories. The approach reveals that adversarial attack sequences exhibit distinctive 'restlessness' patterns across model architectures, though detection effectiveness varies significantly when deployed on real-world data.

AINeutralarXiv – CS AI · Jun 86/10

🧠

TRACE: Trajectory Reasoning through Adaptive Cross-Step Evidence Aggregation for LLM Agents

Researchers introduce TRACE, a monitoring framework designed to detect malicious behavior in autonomous LLM agents by tracking evidence across long sequences of seemingly benign actions. The system achieves 0.713 F1 score and 0.844 recall on benchmark tests, addressing a critical security gap where agents can pursue hidden objectives through temporally distributed steps.

AIBullisharXiv – CS AI · Mar 36/106

🧠

What Helps -- and What Hurts: Bidirectional Explanations for Vision Transformers

Researchers propose BiCAM, a new method for interpreting Vision Transformer (ViT) decisions that captures both positive and negative contributions to predictions. The approach improves explanation quality and enables adversarial example detection across multiple ViT variants without requiring model retraining.