🧠 AI⚪ NeutralImportance 6/10

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

arXiv – CS AI|Paramananda Bhaskar, Naquee Rizwan, Daksh Jogchand, Saurabh Kumar Pandey, Animesh Mukherjee|June 1, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce FBHM, a systematically curated benchmark for evaluating vision-language models on hateful meme detection across 25 rhetorical functionalities and 10 target communities. The study reveals that state-of-the-art VLMs exhibit severe generalization failures, dropping from high accuracy on standard datasets to near-random performance on FBHM, indicating they rely on dataset-specific shortcuts rather than robust multimodal reasoning. The proposed LSV (learnable steering vectors) method achieves ~30 Macro-F1 point improvements using minimal training data without degrading source-domain performance.

Analysis

This research addresses a critical vulnerability in vision-language models tasked with content moderation at scale. The FBHM benchmark fundamentally challenges the assumption that high performance on existing hateful meme datasets translates to reliable real-world detection capabilities. By isolating rhetorical mechanisms from target community features through systematic orthogonal design, researchers expose how contemporary VLMs exploit spurious correlations rather than learning generalizable multimodal reasoning patterns.

The generalization gap revealed here reflects a broader pattern in AI safety research: benchmark-driven optimization can create illusions of progress while masking fundamental model limitations. This matters particularly for content moderation systems deployed by social platforms, where catastrophic performance drops on unseen hate speech variants represent significant risks. The confounding of rhetorical hate mechanisms with community features in previous benchmarks created blind spots that production systems inherited.

The LSV intervention method offers practical value by demonstrating that causal steering with minimal data can substantially improve model robustness. Using only 500 steering samples derived from 50 base memes to achieve 30-point improvements suggests that targeted, theoretically grounded training approaches outperform broader fine-tuning strategies. This has implications for resource-constrained deployment scenarios where practitioners cannot afford extensive annotation efforts.

The research establishes a precedent for functionality-based benchmark design that other AI safety researchers may adopt. Future work likely includes extending this methodology to other sensitive tasks where model vulnerabilities cluster around specific failure modes rather than distributed dataset artifacts. The balance between improving detection accuracy and avoiding false-positive content removal remains a central tension in this space.

Key Takeaways

→State-of-the-art VLMs show severe generalization collapse on FBHM despite high accuracy on standard benchmarks, revealing dependence on dataset-specific shortcuts
→The benchmark's orthogonal design separating rhetorical mechanisms from target communities enables causal evaluation impossible with previous observational datasets
→LSV steering vectors achieve ~30 Macro-F1 improvement with minimal data (500 samples), outperforming in-context learning and PEFT methods
→Existing hateful meme detection benchmarks confound multiple factors, preventing accurate diagnosis of model vulnerabilities and safety risks
→Functionality-based benchmark design offers a replicable methodology for evaluating and improving model robustness across AI safety applications