#visual-reasoning News & Analysis

30 articles tagged with #visual-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

30 articles

AIBullisharXiv – CS AI · 4d ago7/10

🧠

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

InterSketch introduces a new vision-language model architecture that combines visual sketches with textual reasoning in an interleaved chain-of-thought approach, moving beyond text-centric AI paradigms. The model uses self-correction mechanisms and stepwise reward functions during reinforcement learning to improve performance on complex visual reasoning tasks, reportedly outperforming proprietary models like Gemini-3-Pro.

🧠 Gemini

AIBullisharXiv – CS AI · 4d ago7/10

🧠

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Researchers introduce Athena-PRM, a multimodal process reward model that evaluates reasoning steps in complex problem-solving with remarkable data efficiency, requiring only 5,000 samples. The model leverages prediction consistency between weak and strong AI completers to generate high-quality training labels, achieving state-of-the-art results across multiple benchmarks including WeMath, MathVista, and VisualProcessBench.

AIBearisharXiv – CS AI · May 127/10

🧠

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

Researchers reveal that multimodal large language models achieve high visual reasoning benchmark scores by exploiting a 'Cartesian Shortcut'—leveraging grid-based layouts that convert to explicit text coordinates rather than performing genuine visual understanding. The Polaris-Bench study shows frontier models collapse from 70-83% accuracy to 31-39% when benchmarks are reformulated in polar coordinate space, exposing critical deficiencies in topology-invariant reasoning.

AINeutralarXiv – CS AI · Apr 147/10

🧠

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

Researchers identify a critical failure mode in multimodal AI reasoning models called Reasoning Vision Truth Disconnect (RVTD), where hallucinations occur at high-entropy decision points when models abandon visual grounding. They propose V-STAR, a training framework using hierarchical visual attention rewards and forced reflection mechanisms to anchor reasoning back to visual evidence and reduce hallucinations in long-chain tasks.

AIBullisharXiv – CS AI · Apr 77/10

🧠

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

Researchers introduce V-Reflection, a new framework that transforms Multimodal Large Language Models (MLLMs) from passive observers to active interrogators through a 'think-then-look' mechanism. The approach addresses perception-related hallucinations in fine-grained tasks by allowing models to dynamically re-examine visual details during reasoning, showing significant improvements across six perception-intensive benchmarks.

AINeutralarXiv – CS AI · Apr 67/10

🧠

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

Researchers propose the Hallucination-as-Cue Framework to analyze reinforcement learning's effectiveness in training multimodal AI models. The study reveals that RL training can improve reasoning performance even under hallucination-inductive conditions, challenging assumptions about how these models learn from visual information.

AIBullisharXiv – CS AI · Apr 67/10

🧠

Training Multi-Image Vision Agents via End2End Reinforcement Learning

Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.

🏢 OpenAI🧠 o1🧠 o3

AIBullisharXiv – CS AI · Mar 57/10

🧠

From Narrow to Panoramic Vision: Attention-Guided Cold-Start Reshapes Multimodal Reasoning

Researchers introduce Visual Attention Score (VAS) to analyze multimodal reasoning models, discovering that higher visual attention correlates strongly with better performance (r=0.9616). They propose AVAR framework that achieves 7% performance gains on Qwen2.5-VL-7B across multimodal reasoning benchmarks.

AIBullisharXiv – CS AI · Mar 46/103

🧠

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

Researchers introduce VC-STaR, a new framework that improves visual reasoning in vision-language models by using contrastive image pairs to reduce hallucinations. The approach creates VisCoR-55K, a new dataset that outperforms existing visual reasoning methods when used for model fine-tuning.

AINeutralarXiv – CS AI · Mar 46/103

🧠

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Researchers introduce ViPlan, the first benchmark for comparing Vision-Language Model planning approaches, finding that VLM-as-grounder methods excel in visual tasks like Blocksworld while VLM-as-planner methods perform better in household robotics scenarios. The study reveals fundamental limitations in current VLMs' visual reasoning abilities, with Chain-of-Thought prompting showing no consistent benefits.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Researchers introduce Doc-CoB, a new framework that improves how AI models understand documents by progressively focusing on relevant layout regions while maintaining global context. The approach combines coarse-to-fine visual reasoning with multimodal large language models and demonstrates significant performance improvements across seven benchmarks.

AIBullisharXiv – CS AI · May 126/10

🧠

Do multimodal models imagine electric sheep?

Researchers demonstrate that large multimodal models develop internal visual representations when solving spatial reasoning tasks, improving puzzle-solving accuracy from 83% to 89% by integrating visual tokens into chain-of-thought reasoning. The findings suggest AI systems spontaneously form world models without explicit visual supervision, with practical applications for enhancing spatial reasoning capabilities.

AINeutralarXiv – CS AI · May 96/10

🧠

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Researchers developed a causal probing framework to decode how Multimodal Large Language Models internally represent visual concepts, revealing that entities are encoded in localized regions while abstract concepts distribute globally across networks. The findings expose mechanistic drivers of scaling laws and uncover a disconnect between visual perception and reasoning capabilities in MLLMs.

AINeutralarXiv – CS AI · May 46/10

🧠

InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

Researchers introduce InterChart, a benchmark designed to evaluate how well vision-language models (VLMs) reason across multiple related charts—a capability essential for financial analysis, scientific reporting, and policy dashboards. Testing reveals that state-of-the-art VLMs struggle significantly as chart complexity increases, performing better when multi-entity charts are decomposed into simpler components, highlighting a critical gap in multimodal reasoning capabilities.

AINeutralarXiv – CS AI · Apr 206/10

🧠

ReactBench: A Benchmark for Topological Reasoning in MLLMs on Chemical Reaction Diagrams

Researchers introduce ReactBench, a benchmark that exposes critical limitations in multimodal large language models' ability to reason about complex topological structures in chemical reaction diagrams. Testing 17 MLLMs reveals a 30%+ performance gap between simple anchor-based tasks and sophisticated structural reasoning tasks, indicating that visual reasoning capabilities remain fundamentally constrained despite strong semantic recognition abilities.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Mind's Eye: A Benchmark of Visual Abstraction, Transformation and Composition for Multimodal LLMs

Researchers introduced 'Mind's Eye,' a benchmark that tests multimodal large language models (MLLMs) on visual reasoning tasks inspired by human intelligence tests. The evaluation reveals a significant gap between human performance (80% accuracy) and leading MLLMs (below 50%), exposing limitations in visuospatial reasoning, visual attention, and conceptual abstraction.

AINeutralarXiv – CS AI · Apr 136/10

🧠

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

Researchers introduce 3D-VCD, an inference-time framework that reduces hallucinations in 3D-LLM embodied agents by contrasting predictions against distorted scene graphs. The method addresses failures specific to 3D spatial reasoning without requiring model retraining, advancing reliability in embodied AI systems.

AIBullisharXiv – CS AI · Apr 136/10

🧠

VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

Researchers introduce VISOR, a new agentic visual retrieval-augmented generation system that improves how AI models reason over multi-page visual documents. By addressing key technical challenges in evidence gathering and context management, VISOR achieves state-of-the-art results on complex visual reasoning tasks.

AINeutralarXiv – CS AI · Mar 176/10

🧠

VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining

Researchers introduce VTC-Bench, a comprehensive benchmark for evaluating multimodal AI models' ability to use visual tools for complex tasks. The benchmark reveals significant limitations in current models, with leading model Gemini-3.0-Pro achieving only 51% accuracy on multi-tool visual reasoning tasks.

🧠 Gemini

AIBullisharXiv – CS AI · Mar 176/10

🧠

Thinking in Uncertainty: Mitigating Hallucinations in MLRMs with Latent Entropy-Aware Decoding

Researchers propose Latent Entropy-Aware Decoding (LEAD), a new method to reduce hallucinations in multimodal large reasoning models by switching between continuous and discrete token embeddings based on entropy states. The technique addresses issues where transition words correlate with high-entropy states that lead to unreliable outputs in visual question answering tasks.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Researchers have identified that multimodal large language models (MLLMs) lose visual focus during complex reasoning tasks, with attention becoming scattered across images rather than staying on relevant regions. They propose a training-free Visual Region-Guided Attention (VRGA) framework that improves visual grounding and reasoning accuracy by reweighting attention to question-relevant areas.

AIBullisharXiv – CS AI · Mar 116/10

🧠

RECODE: Reasoning Through Code Generation for Visual Question Answering

Researchers introduce RECODE, a new framework that improves visual reasoning in AI models by converting images into executable code for verification. The system generates multiple candidate programs to reproduce visuals, then selects and refines the most accurate reconstruction, significantly outperforming existing methods on visual reasoning benchmarks.

AINeutralarXiv – CS AI · Mar 96/10

🧠

VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs

Researchers introduced VisioMath, a new benchmark with 1,800 K-12 math problems designed to test Large Multimodal Models' ability to distinguish between visually similar diagrams. The study reveals that current state-of-the-art models struggle with fine-grained visual reasoning, often relying on shallow positional heuristics rather than proper image-text alignment.

AIBullisharXiv – CS AI · Mar 36/108

🧠

AdaFocus: Knowing When and Where to Look for Adaptive Visual Reasoning

AdaFocus is a new training-free framework for adaptive visual reasoning in Multimodal Large Language Models that addresses perceptual redundancy and spatial attention issues. The system uses a two-stage pipeline with confidence-based cropping decisions and semantic-guided localization, achieving 4x faster inference than existing methods while improving accuracy.

AIBullisharXiv – CS AI · Mar 37/108

🧠

VisRef: Visual Refocusing while Thinking Improves Test-Time Scaling in Multi-Modal Large Reasoning Models

Researchers developed VisRef, a new framework that improves visual reasoning in large AI models by re-injecting relevant visual tokens during the reasoning process. The method avoids expensive reinforcement learning fine-tuning while achieving up to 6.4% performance improvements on visual reasoning benchmarks.

Page 1 of 2Next →