Benchmarking Deflection and Hallucination in Large Vision-Language Models
Researchers introduce VLM-DeflectionBench, a new benchmark with 2,775 samples designed to evaluate how large vision-language models handle conflicting or insufficient evidence. The study reveals that most state-of-the-art LVLMs fail to appropriately deflect when faced with noisy or misleading information, highlighting critical gaps in model reliability for knowledge-intensive tasks.
This research addresses a fundamental blind spot in how vision-language models are currently evaluated. While benchmarks typically measure what models know, this work probes what models do when they don't know—a distinction with serious implications for real-world deployment. The VLM-DeflectionBench dataset tackles a genuinely difficult problem: distinguishing between parametric knowledge (what's stored in model weights) and retrieval-augmented capability, while testing behavior under adversarial conditions.
The motivation stems from a real weakness in existing benchmarks. As training datasets grow larger and models become more capable, many supposedly retrieval-dependent questions can be answered from memory alone, rendering benchmarks obsolete. More critically, when visual and textual evidence conflict or when retrieved information is incomplete, safe models should decline to answer rather than hallucinate. The authors' dynamic curation pipeline addresses obsolescence by filtering for genuinely retrieval-dependent samples over time.
The findings carry weight for AI safety and deployment. Across 20 state-of-the-art models, consistent failures to deflect suggest systemic issues in training objectives or fine-tuning approaches. This matters for applications like medical imaging analysis, legal document review, or any knowledge-intensive domain where false confidence poses real risks. The benchmark's fine-grained protocol—disentangling memorization from robustness—provides precision lacking in earlier evaluation frameworks.
The resource will likely influence how researchers train and evaluate multimodal systems. Future work will probably focus on deflection training techniques and mechanisms that improve model uncertainty calibration when retrieval yields conflicting signals.
- →Large vision-language models consistently fail to decline answering when evidence is conflicting or insufficient, preferring hallucination over appropriate deflection.
- →Existing benchmarks become obsolete as growing training data allows models to memorize answers rather than rely on retrieval, requiring dynamic curation strategies.
- →The new VLM-DeflectionBench dataset of 2,775 samples provides a reusable framework for evaluating retrieval robustness and distinguishing parametric knowledge from genuine multimodal understanding.
- →Fine-grained evaluation across four scenarios reveals that model reliability depends not just on what they know but critically on safe behavior under uncertainty and noisy evidence.
- →Results highlight systemic training gaps that could pose safety risks in knowledge-intensive applications like medical imaging, legal analysis, and scientific research.