🧠 AI⚪ NeutralImportance 6/10

Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks

arXiv – CS AI|Qihua Dong, Kuo Yang, Lin Ju, Handong Zhao, Yitian Zhang, Yizhou Wang, Huimin Zeng, Jianglin Lu, Yun Fu|March 2, 2026 at 05:00 AM|12 views

🤖AI Summary

Researchers introduce Ref-Adv, a new benchmark for testing multimodal large language models' visual reasoning capabilities in referring expression tasks. The benchmark reveals that current MLLMs, despite performing well on standard datasets like RefCOCO, rely heavily on shortcuts and show significant gaps in genuine visual reasoning and grounding abilities.

Key Takeaways

→Ref-Adv benchmark exposes weaknesses in current multimodal LLMs that standard REC benchmarks miss due to shortcut solutions.
→The dataset features linguistically complex expressions with hard distractors to eliminate easy pattern matching.
→Models that perform well on RefCOCO, RefCOCO+, and RefCOCOg show marked performance drops on Ref-Adv.
→Current MLLMs demonstrate reliance on simple cues rather than genuine text understanding and visual reasoning.
→The research provides comprehensive failure analysis to guide future development of visual reasoning in MLLMs.