#vlm-safety News & Analysis

4 articles tagged with #vlm-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBullisharXiv – CS AI · Jun 27/10

🧠

Detect Before You Leap: Mirage Detection in Vision-Language Models

Researchers have developed TC-LIA, a model-agnostic detection method that identifies when Vision-Language Models produce confident but visually ungrounded answers—a failure mode called 'mirage.' The technique achieves 94.6-94.7% accuracy in detecting these hallucinations across multiple VLM architectures, reducing mirage rates from 21.7-66.6% to below 3%, with significant implications for medical and document-based AI systems where false confidence poses safety risks.

AIBullisharXiv – CS AI · May 117/10

🧠

Sparse Autoencoders as Plug-and-Play Firewalls for Adversarial Attack Detection in VLMs

Researchers propose SAEgis, a lightweight adversarial attack detection framework using sparse autoencoders (SAEs) to protect vision-language models from adversarial perturbations. The plug-and-play method requires no additional adversarial training and demonstrates strong cross-domain generalization, addressing a critical safety gap in increasingly deployed VLM systems.

AIBearisharXiv – CS AI · May 97/10

🧠

How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

Researchers present ImmersedPrivacy, an evaluation framework that tests Vision-Language Models' ability to recognize and respect privacy in physical environments. Testing 12 state-of-the-art VLMs reveals significant deficiencies: all models struggle with cluttered scenes, none exceed 65% accuracy when social context changes, and even the best model only balances task completion with privacy preservation 51% of the time.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Harnessing Hyperbolic Geometry for Harmful Prompt Detection and Sanitization

Researchers propose HyPE and HyPS, a two-part defense framework using hyperbolic geometry to detect and neutralize harmful prompts in Vision-Language Models. The approach offers a lightweight, interpretable alternative to blacklist filters and classifier-based systems that are vulnerable to adversarial attacks.