y0news
#multimodal-ai13 articles
13 articles
AINeutralarXiv โ€“ CS AI ยท 4h ago2
๐Ÿง 

Unlocking Cognitive Capabilities and Analyzing the Perception-Logic Trade-off

Researchers introduce MERaLiON2-Omni (Alpha), a 10B-parameter multilingual AI model designed for Southeast Asia that combines perception and reasoning capabilities. The study reveals an efficiency-stability paradox where reasoning enhances abstract tasks but causes instability in basic sensory processing like audio timing and visual interpretation.

AIBullisharXiv โ€“ CS AI ยท 4h ago2
๐Ÿง 

Uncertainty Quantification for Multimodal Large Language Models with Incoherence-adjusted Semantic Volume

Researchers introduce UMPIRE, a new training-free framework for quantifying uncertainty in Multimodal Large Language Models (MLLMs) across various input and output modalities. The system measures incoherence-adjusted semantic volume of model responses to better detect errors and improve reliability without requiring external tools or additional computational overhead.

AIBullisharXiv โ€“ CS AI ยท 4h ago4
๐Ÿง 

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning

Researchers introduce PointCoT, a new AI framework that enables multimodal large language models to perform explicit geometric reasoning on 3D point cloud data using Chain-of-Thought methodology. The framework addresses current limitations where AI models suffer from geometric hallucinations by implementing a 'Look, Think, then Answer' paradigm with 86k instruction-tuning samples.

AIBullisharXiv โ€“ CS AI ยท 4h ago2
๐Ÿง 

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Researchers have developed an 'Omnivorous Vision Encoder' that creates consistent feature representations across different visual modalities (RGB, depth, segmentation) of the same scene. The framework addresses the poor cross-modal alignment in existing vision encoders like DINOv2 by training with dual objectives to maximize feature alignment while preserving discriminative semantics.

AIBullisharXiv โ€“ CS AI ยท 4h ago14
๐Ÿง 

Reallocating Attention Across Layers to Reduce Multimodal Hallucination

Researchers propose a training-free solution to reduce hallucinations in multimodal AI models by rebalancing attention between perception and reasoning layers. The method achieves 4.2% improvement in reasoning accuracy with minimal computational overhead.

AIBullisharXiv โ€“ CS AI ยท 4h ago10
๐Ÿง 

Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Researchers developed Speculative Verdict (SV), a training-free framework that improves large Vision-Language Models' ability to reason over information-dense images by combining multiple small draft models with a larger verdict model. The approach achieves better accuracy on visual question answering benchmarks while reducing computational costs compared to large proprietary models.

AIBullisharXiv โ€“ CS AI ยท 4h ago12
๐Ÿง 

DeepEyesV2: Toward Agentic Multimodal Model

DeepEyesV2 is a new agentic multimodal AI model that combines text and image comprehension with external tool integration like code execution and web search. The research introduces a two-stage training pipeline and RealX-Bench evaluation framework, demonstrating improved real-world reasoning capabilities through adaptive tool invocation.

AINeutralarXiv โ€“ CS AI ยท 4h ago4
๐Ÿง 

City Editing: Hierarchical Agentic Execution for Dependency-Aware Urban Geospatial Modification

Researchers have developed a hierarchical AI agent system that can automatically modify urban planning layouts using natural language instructions and GeoJSON data. The system decomposes editing tasks into geometric operations across multiple spatial levels and includes validation mechanisms to ensure spatial consistency during multi-step urban modifications.

$MATIC