#vlm-evaluation News & Analysis

9 articles tagged with #vlm-evaluation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

9 articles

AIBullisharXiv – CS AI · Jun 97/10

🧠

MedVision: Benchmarking Quantitative Medical Image Analysis

Researchers introduce MedVision, a large-scale benchmark dataset with 30.8 million image-annotation pairs designed to evaluate and improve vision-language models (VLMs) on quantitative medical image analysis tasks. The work demonstrates that current VLMs perform poorly on clinical quantitative reasoning—such as tumor measurement and joint angle assessment—but can be significantly improved through supervised and reinforcement fine-tuning.

AIBearisharXiv – CS AI · Apr 147/10

🧠

What Users Leave Unsaid: Under-Specified Queries Limit Vision-Language Models

Researchers introduce HAERAE-Vision, a benchmark of 653 real-world underspecified visual questions from Korean online communities, revealing that state-of-the-art vision-language models achieve under 50% accuracy on natural queries despite performing well on structured benchmarks. The study demonstrates that query clarification alone improves performance by 8-22 points, highlighting a critical gap between current evaluation standards and real-world deployment requirements.

🧠 GPT-5🧠 Gemini

AINeutralarXiv – CS AI · Jun 256/10

🧠

AMVICC: A Novel Benchmark for Cross-Modal Failure Mode Profiling for VLMs and IGMs

Researchers introduce AMVICC, a novel benchmark for evaluating failure modes in vision-language models (VLMs) and image generation models (IGMs). Testing 11 multimodal LLMs and 3 IGMs across 9 visual reasoning categories, the study reveals that both model types struggle with basic visual concepts like object orientation, quantity, and spatial relationships, with some failures shared across modalities and others model-specific.

AINeutralarXiv – CS AI · Jun 196/10

🧠

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

Researchers introduce RTSGameBench, a comprehensive benchmark for evaluating Vision-Language Models' strategic reasoning capabilities using real-time strategy games. The framework reveals that current state-of-the-art VLMs struggle with coordination, multiagent scenarios, and complex large-scale tasks, highlighting a critical gap in AI reasoning abilities.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

Researchers introduce EngVQA, a benchmark for evaluating Vision-Language Models' engineering reasoning capabilities across 696 problems spanning five engineering subjects. The study reveals significant limitations in current VLMs' ability to perform multi-step technical reasoning while maintaining physical consistency, despite their strong performance on general multimodal tasks.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

Researchers propose a density ridge-based method for detecting hallucinations in large language and vision-language models that outperforms existing approaches by 5-20 AUROC points while requiring minimal calibration labels. The technique maps hidden state trajectories to a low-dimensional geometric skeleton, enabling robust hallucination detection even when training data is scarce.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Sci-Rho: A Multilingual Visually-Grounded Symbolic Benchmark for STEM Problems

Researchers introduce Sci-Rho, a multilingual benchmark comprising 42,420 visually-grounded STEM problem instances across seven languages designed to test the robustness of vision-language models. The study reveals significant gaps between average and worst-case accuracy, with smaller models showing greater performance degradation across languages while larger proprietary models demonstrate better robustness.

AINeutralarXiv – CS AI · Jun 26/10

🧠

From Segments to Scenes: Temporal Understanding in Autonomous Driving via Vision-Language Model

Researchers introduce the Temporal Understanding in Autonomous Driving (TAD) benchmark, a dataset of nearly 6,000 QA pairs designed to evaluate vision-language models' ability to understand temporal sequences in driving scenarios. The study reveals that state-of-the-art VLMs significantly underperform on temporal reasoning tasks and proposes two training-free solutions—Scene-CoT and TCogMap—that improve accuracy by up to 17.72% on the benchmark.

🏢 Hugging Face

AINeutralarXiv – CS AI · Mar 36/103

🧠

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Researchers introduce OmniSpatial, a comprehensive benchmark for testing spatial reasoning capabilities in vision-language models (VLMs). The benchmark reveals significant limitations in both open and closed-source VLMs across four major spatial reasoning categories, with over 8,400 question-answer pairs testing advanced cognitive abilities.

$NEAR