AINeutralarXiv – CS AI · 3h ago6/10
🧠
When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?
Researchers demonstrate that explicit image-tool interaction in vision-language models reduces jailbreak success rates by approximately 30% compared to direct response generation. The protective effect stems from a safety-relevant shift in hidden representations rather than benign image semantics alone, suggesting image-tool invocation is a promising architectural pattern for improving multimodal AI safety.