#visual-reasoning News & Analysis

55 articles tagged with #visual-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

55 articles

AINeutralarXiv – CS AI · Jun 96/10

🧠

See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding

Researchers introduce CoVER, a new framework for Video Large Language Models that improves long-video understanding by gathering multiple search queries for visual evidence and using answer-specific visual feedback for verification. The approach demonstrates superior performance compared to similarly-sized models and some closed-source alternatives.

AINeutralarXiv – CS AI · Jun 96/10

🧠

PhysScene: A Scene Graph Dataset for Scientific Visual Reasoning in Physics Experiments

Researchers introduce PhysScene, the first scene graph dataset specifically designed for physics experiments, enabling AI systems to understand complex scientific setups through structured visual reasoning. The dataset prioritizes semantic accuracy and relational density over scale, addressing a gap in domain-specific AI training data for scientific applications.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Researchers introduce MGSD, a self-distillation framework that improves vision-language models' ability to perform visual spatial planning by using symbolic state data during training to bridge the perception-reasoning gap. The approach achieves 18-19% performance improvements on visual planning benchmarks while maintaining purely visual inference.

AINeutralarXiv – CS AI · Jun 56/10

🧠

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

Researchers introduce ViCuR, a visual-grounded distillation framework that improves multimodal AI reasoning by using recoverable visual cues instead of answer-dependent privileges. The approach achieves consistent performance gains across seven benchmarks with Qwen3-VL models by eliminating train-test mismatches that encourage shortcut learning rather than genuine visual understanding.

AINeutralarXiv – CS AI · Jun 46/10

🧠

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

Researchers introduced VAMPS, a benchmark dataset of 1,168 mathematical problems designed to test whether multimodal AI models can effectively use visualization tools to solve complex algebra and calculus problems. Surprisingly, the study found that direct analytical solving consistently outperformed graph-assisted approaches across multiple models, even when visualization should theoretically help.

AINeutralarXiv – CS AI · Jun 46/10

🧠

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

Researchers introduce NoRA, a visual reasoning benchmark that evaluates whether AI models can generate and justify appropriate actions in first-person video scenarios through explicit reasoning graphs. The benchmark reveals that current multimodal language models struggle to construct complete action spaces and properly ground decisions in visible evidence, highlighting a critical gap between selecting plausible actions and explaining them through verifiable reasoning.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Diversity Over Frequency: Rethinking Tool Use in Visual Chain-of-Thought Agents

Researchers discover that visual reasoning agents exhibit a 'tool-use collapse' phenomenon where models progressively abandon external visual tools while maintaining or improving task accuracy. By introducing entropy regularization to encourage diverse exploration rather than optimizing tool frequency, the team achieves superior performance on complex tasks like 3D spatial reasoning and medical visual question answering, suggesting diversity matters more than tool usage frequency.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Diagnosing Failure Modes of Shared-State Collaboration in Resource-Constrained Visual Agents

Researchers introduce CoSee, an auditing framework for analyzing failure modes in collaborative visual reasoning systems using resource-constrained language models (4B-8B parameters). The study reveals that shared working memory architectures paradoxically amplify hallucinations rather than improve performance, identifying two critical failure modes: noise reinforcement and policy collapse.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

Researchers propose EAGLE, a framework that improves multi-agent vision-language model collaboration by requiring agents to align on visual evidence from images, not just final answers. The training-free approach demonstrates superior performance across six VQA benchmarks while maintaining interpretability and practical deployment capabilities.

AINeutralarXiv – CS AI · May 286/10

🧠

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

Researchers introduce ROVER, a lightweight plugin that enhances multimodal large language models' ability to reason across multiple images by intelligently routing visual evidence to specific objects. The approach achieves significant performance improvements on grounded reasoning benchmarks while reducing computational overhead compared to existing methods.

AINeutralarXiv – CS AI · May 276/10

🧠

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Researchers introduce Doc-CoB, a new framework that improves how AI models understand documents by progressively focusing on relevant layout regions while maintaining global context. The approach combines coarse-to-fine visual reasoning with multimodal large language models and demonstrates significant performance improvements across seven benchmarks.

AIBullisharXiv – CS AI · May 126/10

🧠

Do multimodal models imagine electric sheep?

Researchers demonstrate that large multimodal models develop internal visual representations when solving spatial reasoning tasks, improving puzzle-solving accuracy from 83% to 89% by integrating visual tokens into chain-of-thought reasoning. The findings suggest AI systems spontaneously form world models without explicit visual supervision, with practical applications for enhancing spatial reasoning capabilities.

AINeutralarXiv – CS AI · May 96/10

🧠

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Researchers developed a causal probing framework to decode how Multimodal Large Language Models internally represent visual concepts, revealing that entities are encoded in localized regions while abstract concepts distribute globally across networks. The findings expose mechanistic drivers of scaling laws and uncover a disconnect between visual perception and reasoning capabilities in MLLMs.

AINeutralarXiv – CS AI · May 46/10

🧠

InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

Researchers introduce InterChart, a benchmark designed to evaluate how well vision-language models (VLMs) reason across multiple related charts—a capability essential for financial analysis, scientific reporting, and policy dashboards. Testing reveals that state-of-the-art VLMs struggle significantly as chart complexity increases, performing better when multi-entity charts are decomposed into simpler components, highlighting a critical gap in multimodal reasoning capabilities.

AINeutralarXiv – CS AI · Apr 206/10

🧠

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Researchers introduce ReactBench, a benchmark that exposes critical limitations in multimodal large language models' ability to reason about complex topological structures in chemical reaction diagrams. Testing 17 MLLMs reveals a 30%+ performance gap between simple anchor-based tasks and sophisticated structural reasoning tasks, indicating that visual reasoning capabilities remain fundamentally constrained despite strong semantic recognition abilities.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Researchers introduced 'Mind's Eye,' a benchmark that tests multimodal large language models (MLLMs) on visual reasoning tasks inspired by human intelligence tests. The evaluation reveals a significant gap between human performance (80% accuracy) and leading MLLMs (below 50%), exposing limitations in visuospatial reasoning, visual attention, and conceptual abstraction.

AINeutralarXiv – CS AI · Apr 136/10

🧠

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

Researchers introduce 3D-VCD, an inference-time framework that reduces hallucinations in 3D-LLM embodied agents by contrasting predictions against distorted scene graphs. The method addresses failures specific to 3D spatial reasoning without requiring model retraining, advancing reliability in embodied AI systems.

AIBullisharXiv – CS AI · Apr 136/10

🧠

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Researchers introduce VISOR, a new agentic visual retrieval-augmented generation system that improves how AI models reason over multi-page visual documents. By addressing key technical challenges in evidence gathering and context management, VISOR achieves state-of-the-art results on complex visual reasoning tasks.

AINeutralarXiv – CS AI · Mar 176/10

🧠

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Researchers introduce VTC-Bench, a comprehensive benchmark for evaluating multimodal AI models' ability to use visual tools for complex tasks. The benchmark reveals significant limitations in current models, with leading model Gemini-3.0-Pro achieving only 51% accuracy on multi-tool visual reasoning tasks.

🧠 Gemini

AIBullisharXiv – CS AI · Mar 176/10

🧠

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Researchers propose Latent Entropy-Aware Decoding (LEAD), a new method to reduce hallucinations in multimodal large reasoning models by switching between continuous and discrete token embeddings based on entropy states. The technique addresses issues where transition words correlate with high-entropy states that lead to unreliable outputs in visual question answering tasks.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Researchers have identified that multimodal large language models (MLLMs) lose visual focus during complex reasoning tasks, with attention becoming scattered across images rather than staying on relevant regions. They propose a training-free Visual Region-Guided Attention (VRGA) framework that improves visual grounding and reasoning accuracy by reweighting attention to question-relevant areas.

AIBullisharXiv – CS AI · Mar 116/10

🧠

RECODE: Reasoning Through Code Generation for Visual Question Answering

Researchers introduce RECODE, a new framework that improves visual reasoning in AI models by converting images into executable code for verification. The system generates multiple candidate programs to reproduce visuals, then selects and refines the most accurate reconstruction, significantly outperforming existing methods on visual reasoning benchmarks.

AINeutralarXiv – CS AI · Mar 96/10

🧠

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

Researchers introduced VisioMath, a new benchmark with 1,800 K-12 math problems designed to test Large Multimodal Models' ability to distinguish between visually similar diagrams. The study reveals that current state-of-the-art models struggle with fine-grained visual reasoning, often relying on shallow positional heuristics rather than proper image-text alignment.

AIBullisharXiv – CS AI · Mar 36/108

🧠

AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning

AdaFocus is a new training-free framework for adaptive visual reasoning in Multimodal Large Language Models that addresses perceptual redundancy and spatial attention issues. The system uses a two-stage pipeline with confidence-based cropping decisions and semantic-guided localization, achieving 4x faster inference than existing methods while improving accuracy.

AIBullisharXiv – CS AI · Mar 37/108

🧠

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

Researchers developed VisRef, a new framework that improves visual reasoning in large AI models by re-injecting relevant visual tokens during the reasoning process. The method avoids expensive reinforcement learning fine-tuning while achieving up to 6.4% performance improvements on visual reasoning benchmarks.

← PrevPage 2 of 3Next →