y0news
← Feed
Back to feed
🧠 AI NeutralImportance 7/10

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

arXiv – CS AI|Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu||4 views
🤖AI Summary

Researchers identify a 'safety mirage' problem in vision language models where supervised fine-tuning creates spurious correlations that make models vulnerable to simple attacks and overly cautious with benign queries. They propose machine unlearning as an alternative that reduces attack success rates by up to 60.27% and unnecessary rejections by over 84.20%.

Key Takeaways
  • Current VLM safety fine-tuning creates spurious correlations between textual patterns and safety responses rather than genuine harm mitigation.
  • Simple one-word modifications in queries can bypass safety measures in fine-tuned VLMs due to these spurious correlations.
  • Over-prudent models unnecessarily reject benign queries, limiting their practical utility.
  • Machine unlearning directly removes harmful knowledge while preserving general capabilities without creating biased feature-label mappings.
  • Machine unlearning reduces attack success rates by up to 60.27% and cuts unnecessary rejections by over 84.20%.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles