#adversarial-research News & Analysis

5 articles tagged with #adversarial-research. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBearisharXiv – CS AI · Jun 107/10

🧠

BadRobot: Jailbreaking Embodied LLM Agents in the Physical World

Researchers introduce BadRobot, an attack paradigm that exploits vulnerabilities in embodied LLM agents to make them perform harmful physical actions through voice commands. The study demonstrates successful attacks against prominent frameworks like Voxposer and Code as Policies, revealing critical safety gaps in AI systems integrated into physical robotics.

AINeutralarXiv – CS AI · May 287/10

🧠

CRaFT: Circuit-Guided Refusal Feature Selection via Cross-Layer Transcoders

Researchers propose CRaFT, a circuit-guided framework that identifies critical refusal features in large language models by analyzing inter-feature relationships rather than isolated activation signals. The method improves jailbreak attack success rates from 6.7% to 57.4% across benchmarks, advancing understanding of LLM safety mechanisms and highlighting vulnerabilities in model alignment.

AIBearisharXiv – CS AI · May 97/10

🧠

The Illusion of Forgetting: Attack Unlearned Diffusion via Initial Latent Variable Optimization

Researchers demonstrate that current concept erasure (unlearning) methods in text-to-image diffusion models fail to truly remove harmful knowledge, instead only disrupting the linguistic pathways to that knowledge. They introduce IVO, an attack framework that exploits this weakness by reconstructing the mappings and reviving the dormant memories, exposing fundamental vulnerabilities in 11 existing unlearning techniques.

AIBearisharXiv – CS AI · Apr 157/10

🧠

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Researchers introduce MemJack, a multi-agent framework that exploits semantic vulnerabilities in Vision-Language Models through coordinated jailbreak attacks, achieving 71.48% attack success rates against Qwen3-VL-Plus. The study reveals that current VLM safety measures fail against sophisticated visual-semantic attacks and introduces MemJack-Bench, a dataset of 113,000+ attack trajectories to advance defensive research.

AINeutralarXiv – CS AI · May 286/10

🧠

Cross-Entropy Games and Frost Training

Researchers introduce Frost Training, a novel method that applies gradient-based optimization from embedding space to improve LLM policy training on Cross-Entropy Games. The technique leverages signals previously used only in adversarial jailbreaking to accelerate model performance, achieving higher quality outputs faster in Monte Carlo-based optimization tasks.