#multimodal-llms News & Analysis

18 articles tagged with #multimodal-llms. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

18 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

AIR: Adaptive Interleaved Reasoning with Code in MLLMs

Researchers propose AIR, a framework enhancing multimodal large language models (MLLMs) with adaptive reasoning capabilities through interleaved code execution and reinforcement learning. The approach addresses limitations in existing vision-focused tools by enabling models to handle complex numerical computations, achieving 6.1 percentage point performance improvements and over 95% tool-use success rates.

🏢 OpenAI🧠 o1🧠 o3

AIBullisharXiv – CS AI · Jun 107/10

🧠

From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs

Researchers have mapped how Audio-Visual Large Language Models (AVLLMs) process and integrate audio and visual information internally, revealing distinct information flow patterns depending on input configuration. The study demonstrates that multimodal tokens can be pruned after information transfer with minimal performance impact, enabling more efficient inference across different model scales.

AIBullisharXiv – CS AI · Jun 107/10

🧠

SPACE: Source-free Proxy Anchor Concept Erasure for MLLMs

Researchers introduce SPACE, a source-free machine unlearning framework for multimodal large language models that removes sensitive data without access to original training data. The two-stage approach uses text-guided proxy anchors and dual-constraint semantic isolation to erase target concepts while maintaining model performance, addressing growing privacy and regulatory compliance needs.

AIBearisharXiv – CS AI · Jun 27/10

🧠

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

Researchers introduce CardioLens, a rigorous evaluation framework revealing that state-of-the-art multimodal large language models (MLLMs) perform poorly at clinical cardiac MRI interpretation despite strong public benchmark results. The study demonstrates a significant gap between theoretical capabilities and real-world clinical applicability, with models failing to integrate distributed evidence across imaging sequences and temporal phases.

AIBullisharXiv – CS AI · May 277/10

🧠

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

Researchers introduce Self-Signals Driven Multi-LLM Debate (SID), a method that leverages internal model signals like token logits and attention mechanisms to improve multi-agent LLM reasoning while reducing computational overhead. The approach enables high-confidence models to exit early and compresses redundant debate content, achieving better accuracy with lower token consumption than existing multi-LLM debate techniques.

AIBearisharXiv – CS AI · May 127/10

🧠

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

Researchers reveal that multimodal large language models achieve high visual reasoning benchmark scores by exploiting a 'Cartesian Shortcut'—leveraging grid-based layouts that convert to explicit text coordinates rather than performing genuine visual understanding. The Polaris-Bench study shows frontier models collapse from 70-83% accuracy to 31-39% when benchmarks are reformulated in polar coordinate space, exposing critical deficiencies in topology-invariant reasoning.

AIBearisharXiv – CS AI · May 77/10

🧠

Are Multimodal LLMs Ready for Clinical Dermatology? A Real-World Evaluation in Dermatology

A comprehensive study evaluating five multimodal large language models (MLLMs) on real-world dermatology tasks reveals a significant gap between benchmark performance and clinical applicability. While models achieved up to 42% accuracy on public datasets, performance dropped dramatically to 1.5-24.65% on actual hospital cases, highlighting critical limitations in deploying these systems for clinical decision-making.

🧠 GPT-4

AIBearisharXiv – CS AI · Apr 207/10

🧠

Chain-of-Thought Degrades Visual Spatial Reasoning Capabilities of Multimodal LLMs

Researchers found that Chain-of-Thought prompting, a technique that improves logical reasoning in multimodal AI models, actually degrades performance on visual spatial tasks. The study evaluated seventeen models across thirteen benchmarks and discovered these systems suffer from shortcut learning, hallucinating visual details from text even when images are absent, indicating a fundamental limitation in current AI reasoning paradigms.

AIBullisharXiv – CS AI · Jun 116/10

🧠

Reason, Then Re-reason: Cross-view Revisiting Improves Spatial Reasoning

Researchers propose ReRe, a training-free framework that improves spatial reasoning in egocentric videos by having multimodal AI models first form a hypothesis, then revise it using synthesized novel viewpoints. The approach demonstrates significant performance gains on spatial reasoning benchmarks without modifying existing model architectures.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models

Researchers have developed a systematic framework for conditioning Multimodal Large Language Models (MLLMs) with explicit personality traits, revealing that while personality induction improves certain tasks like image captioning, it can degrade performance on reasoning-heavy tasks like visual question answering. The study demonstrates that model behavior is dynamically modulated by both previous and current personality constraints, exposing fundamental challenges in personality modeling for multimodal AI systems.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Multimodal Large Language Models as Synthetic Participants in Video-Based Studies: An Evaluation

Researchers evaluated whether multimodal large language models (MLLMs) like Gemini 3 Flash and Qwen 3 Omni can replicate human subjective responses in video perception tasks using the Perceived Message Sensation Value framework. The study found significant limitations: MLLMs demonstrated systematic biases including downward mean-shift, central-tendency bias, and inconsistent sensitivity to participant profiles, suggesting current models remain unreliable as synthetic human participants for subjective research.

🧠 Gemini

AINeutralarXiv – CS AI · Jun 96/10

🧠

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

Researchers analyzed how multimodal large language models (MLLMs) perform in repeated reference games compared to humans, finding that while agents align on vocabulary labels, they lack true partner-specific conventions. Using a novel constrained pseudo-dyad baseline, they discovered agents succeed through verbose descriptions rather than the compressed, history-dependent expressions humans develop through entrainment.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

A comprehensive review paper presents a unified framework for analyzing video understanding systems powered by multimodal large language models (MLLMs), organizing capabilities into three functional abilities: watching (perception), remembering (memory), and reasoning (inference). The work identifies key challenges in processing long, sparse, and knowledge-intensive video content while operating under computational constraints.

AINeutralarXiv – CS AI · Jun 16/10

🧠

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

Researchers introduced BilliardPhys-Bench, a benchmark that tests multimodal AI models' ability to predict physical interactions in billiards simulations. The evaluation reveals that leading LLMs from OpenAI, Anthropic, Google, and Alibaba struggle with dynamic physics reasoning, exhibiting systematic failures including a 'stasis bias' where models default to predicting no interaction when physical outcomes become difficult to infer.

🧠 Claude🧠 Gemini

AINeutralarXiv – CS AI · May 286/10

🧠

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Researchers introduce SONIC-O1, a comprehensive benchmark for evaluating multimodal large language models on audio-video understanding tasks. The study reveals significant performance gaps between closed-source and open-source models, particularly in temporal localization, and identifies demographic disparities in model behavior across 60 hours of real-world conversational data.

🏢 Hugging Face

AINeutralarXiv – CS AI · May 286/10

🧠

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation

Researchers introduce Vision-OPD, a self-distillation framework that improves multimodal large language models' ability to detect fine-grained visual details by training full-image models to match the performance of crop-focused models. The technique achieves competitive results against larger models without requiring external teachers, labels, or inference-time tools, addressing a critical weakness in current MLLMs.

AINeutralarXiv – CS AI · Apr 206/10

🧠

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Researchers introduce ReactBench, a benchmark that exposes critical limitations in multimodal large language models' ability to reason about complex topological structures in chemical reaction diagrams. Testing 17 MLLMs reveals a 30%+ performance gap between simple anchor-based tasks and sophisticated structural reasoning tasks, indicating that visual reasoning capabilities remain fundamentally constrained despite strong semantic recognition abilities.

AIBullisharXiv – CS AI · Apr 146/10

🧠

MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

Researchers introduced MMR-AD, a large-scale multimodal dataset designed to benchmark general anomaly detection using Multimodal Large Language Models (MLLMs). The study reveals that current state-of-the-art MLLMs fall short of industrial requirements for anomaly detection, though a proposed baseline model called Anomaly-R1 demonstrates significant improvements through reasoning-based approaches enhanced by reinforcement learning.