#multimodal-reasoning News & Analysis

16 articles tagged with #multimodal-reasoning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

16 articles

AIBullisharXiv – CS AI · Jun 87/10

🧠

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Researchers introduce MemDreamer, a framework that enables Vision-Language Models to process hours-long videos by decoupling perception from reasoning through hierarchical graph memory and agentic retrieval. The approach achieves state-of-the-art results while reducing computational context requirements to 2% of full video ingestion, establishing a new paradigm for long-form multimodal understanding.

AIBullisharXiv – CS AI · Jun 27/10

🧠

PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

Researchers introduce PolarMem, a training-free memory framework that enhances vision-language models by explicitly tracking what has been verified as absent or excluded, not just what is similar. The system uses a polarized graph structure with positive and negative memory relations to reduce logical contradictions and improve reasoning reliability across multiple multimodal benchmarks.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Researchers introduce Perception-Grounded Policy Optimization (PGPO), a novel fine-tuning framework that improves how large vision-language models learn from visual inputs by strategically allocating learning signals to vision-dependent tokens rather than treating all tokens equally. Testing on the Qwen2.5-VL series demonstrates an average 18.7% performance boost across multimodal reasoning benchmarks.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Asking like Socrates: Socrates helps VLMs understand remote sensing images

Researchers introduce RS-EoT (Remote Sensing Evidence-of-Thought), a novel framework that enables vision-language models to reason more effectively about satellite imagery by iteratively seeking visual evidence rather than relying on linguistic patterns. The approach uses a self-play multi-agent system called SocraticAgent and reinforcement learning to address the 'Glance Effect,' where models superficially analyze large-scale remote sensing images, achieving state-of-the-art performance on multiple benchmarks.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Physics Question Scene Graph: Fine-grained Evaluation of Physical Plausibility in Text-to-Video Generation

Researchers introduce Physics Question Scene Graph (PQSG), a new evaluation framework that uses vision-language models to assess whether AI-generated videos obey physical laws. The framework evaluates videos from models like Sora 2 and Veo 3 through hierarchical question graphs, revealing that closed-source models outperform open-source alternatives in physical realism.

🧠 Sora

AIBullisharXiv – CS AI · Jun 236/10

🧠

Gold Points Sniper: Self-guided Visual Reasoning in VLM for Fine-grained Action Understanding

Researchers introduce Gold Points Sniper (GPS), a framework enhancing lightweight vision-language models with self-guided reasoning for fine-grained human action understanding in robotics. The system combines critical detail extraction, self-questioning validation, and semantic entailment checking to achieve GPT-4o-level performance while maintaining superior factual accuracy for domestic robot applications.

🧠 GPT-4

AINeutralarXiv – CS AI · Jun 236/10

🧠

DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework

Researchers introduce DeALOG, a decentralized multi-agent framework that uses specialized AI agents coordinating through a shared natural-language log to answer complex questions spanning text, tables, and images. The system demonstrates competitive performance on multiple benchmarks while improving robustness through collaborative verification without central control.

AINeutralarXiv – CS AI · Jun 196/10

🧠

MedRLM: Recursive Multimodal Health Intelligence for Long-Context Clinical Reasoning, Sensor-Guided Screening, Evidence-Grounded Decision Support, and Community-to-Tertiary Referral Optimization

MedRLM is a new AI framework designed to improve clinical decision support by recursively analyzing heterogeneous patient data across EHR records, medical images, sensor streams, and clinical guidelines. The system uses specialized agents and an evidence graph memory to coordinate reasoning tasks and trigger deeper analysis when abnormal physiological patterns are detected, moving beyond single-step medical AI systems toward more auditable, workflow-integrated clinical tools.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

Researchers introduce EngVQA, a benchmark for evaluating Vision-Language Models' engineering reasoning capabilities across 696 problems spanning five engineering subjects. The study reveals significant limitations in current VLMs' ability to perform multi-step technical reasoning while maintaining physical consistency, despite their strong performance on general multimodal tasks.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Improving Multimodal Reasoning via Worst Dimension Optimization

Researchers propose a worst dimension optimization approach to improve multimodal reasoning in AI systems. Current Process Reward Models fail to detect individual dimensional failures when dominant factors mask underlying weaknesses, compromising reasoning validity across visual and logical constraints.

AIBullisharXiv – CS AI · May 296/10

🧠

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Researchers introduce Hilbert-Geo, a neural-symbolic AI framework for solving solid geometry problems by combining formal language representation with theorem-based reasoning. The system achieves 77.3% accuracy on solid geometry tasks, significantly outperforming leading AI models like GPT-4 and Gemini-2.5-pro, demonstrating advances in multimodal geometric reasoning.

🧠 GPT-5🧠 Gemini

AINeutralarXiv – CS AI · May 286/10

🧠

CyberJurors: A Multi-Agent Simulation Task for E-Commerce Disputes Verdict

Researchers introduce CyberJurors, a multi-agent AI framework and VerdictBench dataset designed to automate e-commerce dispute resolution through simulated jury deliberation. The system decomposes dispute analysis into structured reasoning stages and incorporates multi-agent consensus mechanisms to better align with real-world crowdsourced jury decisions.

🏢 Hugging Face

AINeutralarXiv – CS AI · May 286/10

🧠

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

Researchers introduce VeriTrip, a new benchmark for evaluating travel planning AI agents on their ability to reason over unstructured web data rather than structured APIs. The benchmark addresses critical gaps in agent evaluation by testing performance against information noise, contradictory facts, and multimodal content, revealing a significant trade-off between autonomous information retrieval and instruction following.

AINeutralarXiv – CS AI · May 46/10

🧠

InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

Researchers introduce InterChart, a benchmark designed to evaluate how well vision-language models (VLMs) reason across multiple related charts—a capability essential for financial analysis, scientific reporting, and policy dashboards. Testing reveals that state-of-the-art VLMs struggle significantly as chart complexity increases, performing better when multi-entity charts are decomposed into simpler components, highlighting a critical gap in multimodal reasoning capabilities.

AINeutralarXiv – CS AI · Apr 136/10

🧠

Visually-Guided Policy Optimization for Multimodal Reasoning

Researchers propose Visually-Guided Policy Optimization (VGPO), a framework that enhances vision-language models' ability to focus on visual information during reasoning tasks. The method addresses a fundamental limitation where text-dominated VLMs suffer from weak visual attention and temporal visual forgetting, improving performance on multimodal reasoning and visual-dependent tasks.

AIBearisharXiv – CS AI · Apr 76/10

🧠

Don't Blink: Evidence Collapse during Multimodal Reasoning

Research reveals that Vision Language Models (VLMs) progressively lose visual grounding during reasoning tasks, creating dangerous low-entropy predictions that appear confident but lack visual evidence. The study found attention to visual evidence drops by over 50% during reasoning across multiple benchmarks, requiring task-aware monitoring for safe AI deployment.