y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#mllm-limitations News & Analysis

5 articles tagged with #mllm-limitations. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AIBearisharXiv – CS AI · May 277/10
🧠

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

Researchers introduced CAIT, a benchmark testing multimodal large language models' ability to understand counter-intuitive visual scenes that contradict common sense. The study reveals that open-source MLLMs fail dramatically at these tasks due to language bias, automatically overriding visual evidence with statistically common text patterns, while proprietary models like Claude and Gemini demonstrate robust performance.

🧠 Claude🧠 Gemini
AIBearisharXiv – CS AI · May 127/10
🧠

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

Researchers reveal that multimodal large language models achieve high visual reasoning benchmark scores by exploiting a 'Cartesian Shortcut'—leveraging grid-based layouts that convert to explicit text coordinates rather than performing genuine visual understanding. The Polaris-Bench study shows frontier models collapse from 70-83% accuracy to 31-39% when benchmarks are reformulated in polar coordinate space, exposing critical deficiencies in topology-invariant reasoning.

AINeutralarXiv – CS AI · 18h ago6/10
🧠

Multimodal Large Language Models as Synthetic Participants in Video-Based Studies: An Evaluation

Researchers evaluated whether multimodal large language models (MLLMs) like Gemini 3 Flash and Qwen 3 Omni can replicate human subjective responses in video perception tasks using the Perceived Message Sensation Value framework. The study found significant limitations: MLLMs demonstrated systematic biases including downward mean-shift, central-tendency bias, and inconsistent sensitivity to participant profiles, suggesting current models remain unreliable as synthetic human participants for subjective research.

🧠 Gemini
AINeutralarXiv – CS AI · May 116/10
🧠

R$^3$L: Reasoning 3D Layouts from Relative Spatial Relations

R³L is a new framework that improves 3D layout generation by addressing errors in relative spatial reasoning through invariant spatial decomposition and consistent spatial imagination. The approach tackles the problem of error accumulation in multi-hop reasoning tasks, producing more physically feasible and semantically consistent layouts than previous methods leveraging Multimodal Large Language Models.

AINeutralarXiv – CS AI · Apr 206/10
🧠

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Researchers introduced 'Mind's Eye,' a benchmark that tests multimodal large language models (MLLMs) on visual reasoning tasks inspired by human intelligence tests. The evaluation reveals a significant gap between human performance (80% accuracy) and leading MLLMs (below 50%), exposing limitations in visuospatial reasoning, visual attention, and conceptual abstraction.