βBack to feed
π§ AIβͺ NeutralImportance 7/10
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
π€AI Summary
Researchers identify a 'safety mirage' problem in vision language models where supervised fine-tuning creates spurious correlations that make models vulnerable to simple attacks and overly cautious with benign queries. They propose machine unlearning as an alternative that reduces attack success rates by up to 60.27% and unnecessary rejections by over 84.20%.
Key Takeaways
- βCurrent VLM safety fine-tuning creates spurious correlations between textual patterns and safety responses rather than genuine harm mitigation.
- βSimple one-word modifications in queries can bypass safety measures in fine-tuned VLMs due to these spurious correlations.
- βOver-prudent models unnecessarily reject benign queries, limiting their practical utility.
- βMachine unlearning directly removes harmful knowledge while preserving general capabilities without creating biased feature-label mappings.
- βMachine unlearning reduces attack success rates by up to 60.27% and cuts unnecessary rejections by over 84.20%.
#vision-language-models#ai-safety#machine-unlearning#fine-tuning#adversarial-attacks#alignment#multimodal-ai
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles