What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models
Researchers introduce HAERAE-Vision, a benchmark of 653 real-world underspecified visual questions from Korean online communities, revealing that state-of-the-art vision-language models achieve under 50% accuracy on natural queries despite performing well on structured benchmarks. The study demonstrates that query clarification alone improves performance by 8-22 points, highlighting a critical gap between current evaluation standards and real-world deployment requirements.
The research exposes a fundamental disconnect between how vision-language models are evaluated in academic settings and how they perform in actual user interactions. Traditional benchmarks rely on well-crafted, explicit prompts that bear little resemblance to genuine user behavior, where people frequently omit context and rely on visual information to communicate intent. By extracting 653 questions from authentic Korean online communities and pairing each with an explicit rewrite, researchers created a dataset that reveals the true challenge facing VLM developers.
This finding carries significant implications for AI model development. When state-of-the-art systems like GPT-4 and Gemini 2.5 Pro fail on half of real queries but improve substantially with minor clarification, the issue shifts from raw capability to input interpretation. The 8-22 point improvement gap suggests that many apparent model limitations stem from ambiguous prompts rather than fundamental understanding deficits. Smaller models benefit disproportionately from clarification, indicating that query precision matters most for less sophisticated architectures.
The research also demonstrates that retrieval-augmented approaches cannot compensate for underspecified inputs. Web search failed to narrow the performance gap between vague and explicit queries, suggesting that additional data access alone cannot resolve the comprehension problem. This challenges assumptions underlying many production systems that rely on search augmentation to improve accuracy. For developers deploying VLMs in real-world applications, the findings emphasize that user interface design—particularly prompt guidance and clarification mechanisms—may prove as important as model architecture itself for practical performance.
- →State-of-the-art VLMs achieve under 50% accuracy on real, unstructured user queries despite high performance on traditional benchmarks
- →Query clarification alone yields 8-22 point accuracy improvements, suggesting many apparent model failures result from ambiguous input rather than capability gaps
- →Current retrieval-augmented approaches cannot compensate for underspecified queries, limiting the effectiveness of web search augmentation
- →Smaller vision-language models benefit most from explicit query reformulation, indicating prompt quality significantly impacts less sophisticated architectures
- →The benchmark reveals a critical gap between academic evaluation standards and real-world deployment requirements for practical VLM systems