y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#multimodal-llm News & Analysis

34 articles tagged with #multimodal-llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

34 articles
AIBullisharXiv โ€“ CS AI ยท 6d ago7/10
๐Ÿง 

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.

AIBullisharXiv โ€“ CS AI ยท 6d ago7/10
๐Ÿง 

Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs

Researchers propose Faithful-First RPA, a framework that improves multimodal AI reasoning by prioritizing faithfulness to visual evidence. The method uses FaithEvi for supervision and FaithAct for execution, achieving up to 24% improvement in perceptual faithfulness without sacrificing task accuracy.

AIBullisharXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

Stabilizing Unsupervised Self-Evolution of MLLMs via Continuous Softened Retracing reSampling

Researchers propose Continuous Softened Retracing reSampling (CSRS) to improve the self-evolution of Multimodal Large Language Models by addressing biases in feedback mechanisms. The method uses continuous reward signals instead of binary rewards and achieves state-of-the-art results on mathematical reasoning benchmarks like MathVision using Qwen2.5-VL-7B.

AINeutralarXiv โ€“ CS AI ยท Mar 177/10
๐Ÿง 

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

Researchers identified that medical multimodal large language models (MLLMs) fail primarily due to inadequate visual grounding capabilities when analyzing medical images, unlike their success with natural scenes. They developed VGMED evaluation dataset and proposed VGRefine method, achieving state-of-the-art performance across 6 medical visual question-answering benchmarks without additional training.

AIBullisharXiv โ€“ CS AI ยท Mar 167/10
๐Ÿง 

Cost-Efficient Multimodal LLM Inference via Cross-Tier GPU Heterogeneity

Researchers developed HeteroServe, a system that optimizes multimodal large language model inference by partitioning vision encoding and language generation across different GPU tiers. The approach reduces data transfer requirements and achieves 31-40% cost savings while improving throughput by up to 54% compared to existing systems.

AIBullisharXiv โ€“ CS AI ยท Mar 117/10
๐Ÿง 

Meissa: Multi-modal Medical Agentic Intelligence

Researchers have developed Meissa, a lightweight 4B-parameter medical AI model that brings advanced agentic capabilities offline for healthcare applications. The system matches frontier models like GPT in medical benchmarks while operating with 25x fewer parameters and 22x lower latency, addressing privacy and cost concerns in clinical settings.

๐Ÿง  Gemini
AINeutralarXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

SpatialBench: Benchmarking Multimodal Large Language Models for Spatial Cognition

Researchers introduce SpatialBench, a comprehensive benchmark for evaluating spatial cognition in multimodal large language models (MLLMs). The framework reveals that while MLLMs excel at perceptual grounding, they struggle with symbolic reasoning, causal inference, and planning compared to humans who demonstrate more goal-directed spatial abstraction.

AIBullisharXiv โ€“ CS AI ยท Mar 47/103
๐Ÿง 

OptMerge: Unifying Multimodal LLM Capabilities and Modalities via Model Merging

Researchers introduce OptMerge, a new benchmark and method for combining multiple expert Multimodal Large Language Models (MLLMs) into single, more capable models without requiring additional training data. The approach achieves 2.48% average performance gains while reducing storage and serving costs by merging models across different modalities like vision, audio, and video.

AIBullisharXiv โ€“ CS AI ยท Mar 37/104
๐Ÿง 

SpiroLLM: Finetuning Pretrained LLMs to Understand Spirogram Time Series with Clinical Validation in COPD Reporting

Researchers developed SpiroLLM, the first multimodal large language model capable of understanding spirogram time series data for COPD diagnosis. Using data from 234,028 UK Biobank individuals, the model achieved 0.8977 diagnostic AUROC and maintained 100% valid response rate even with missing data, far outperforming text-only models.

AIBullisharXiv โ€“ CS AI ยท Mar 37/105
๐Ÿง 

Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy

Researchers propose Vid-LLM, a new video-based 3D multimodal large language model that processes video inputs without requiring external 3D data for scene understanding. The model uses a Cross-Task Adapter module and Metric Depth Model to integrate geometric cues and maintain consistency across 3D tasks like question answering and visual grounding.

AINeutralarXiv โ€“ CS AI ยท Feb 277/106
๐Ÿง 

Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs

Researchers identified a fundamental limitation in multimodal LLMs where decoders trained on text cannot effectively utilize non-text information like speaker identity or visual textures, despite this information being preserved through all model layers. The study demonstrates this 'modality collapse' is due to decoder design rather than encoding failures, with experiments showing targeted training can improve specific modality accessibility.

AINeutralarXiv โ€“ CS AI ยท Feb 277/106
๐Ÿง 

ProactiveMobile: A Comprehensive Benchmark for Boosting Proactive Intelligence on Mobile Devices

Researchers introduce ProactiveMobile, a new benchmark for developing AI agents that can proactively anticipate user needs on mobile devices rather than just responding to commands. The benchmark includes over 3,600 test instances across 14 scenarios, with current models achieving low success rates, indicating significant room for improvement in proactive AI capabilities.

AINeutralarXiv โ€“ CS AI ยท 3d ago6/10
๐Ÿง 

Investigating Multimodal Large Language Models to Support Usability Evaluation

Researchers investigate how multimodal large language models (MLLMs) can assist with usability evaluation of user interfaces by analyzing text and visual context together. The study compares MLLM-generated assessments against expert evaluations, finding that these models can effectively prioritize usability issues by severity and offer complementary insights to traditional resource-intensive evaluation methods.

AINeutralarXiv โ€“ CS AI ยท 6d ago6/10
๐Ÿง 

Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing

Q-Probe introduces a novel agentic framework for scaling image quality assessment to high-resolution images by addressing limitations in existing reinforcement learning approaches. The research presents Vista-Bench, a new benchmark for fine-grained degradation analysis, and demonstrates state-of-the-art performance across multiple resolution scales through context-aware probing mechanisms.

AIBullisharXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization

Researchers have developed ForgeryGPT, a new multimodal AI framework that can detect, localize, and explain image forgeries through natural language interaction. The system combines advanced computer vision techniques with large language models to provide interpretable analysis of tampered images, addressing limitations in current forgery detection methods.

๐Ÿง  GPT-4
AINeutralarXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

ReLope: KL-Regularized LoRA Probes for Multimodal LLM Routing

Researchers introduce ReLope, a new routing method for multimodal large language models that uses KL-regularized LoRA probes and attention mechanisms to improve cost-performance balance. The method addresses the challenge of degraded probe performance when visual inputs are added to text-only LLMs.

AINeutralarXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

NeuroVLM-Bench: Evaluation of Vision-Enabled Large Language Models for Clinical Reasoning in Neurological Disorders

Researchers benchmarked 20 multimodal AI models on neuroimaging tasks using MRI and CT scans, finding that while technical attributes like imaging modality are nearly solved, diagnostic reasoning remains challenging. Gemini-2.5-Pro and GPT-5-Chat showed strongest diagnostic performance, while open-source MedGemma-1.5-4B demonstrated promising results under few-shot prompting.

๐Ÿข Meta๐Ÿง  GPT-5๐Ÿง  Gemini
AIBullisharXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

Photon: Speedup Volume Understanding with Efficient Multimodal Large Language Models

Photon is a new framework that efficiently processes 3D medical imaging for AI visual question answering by using variable-length token sequences and adaptive compression. The system reduces computational costs while maintaining accuracy through instruction-conditioned token scheduling and custom gradient propagation techniques.

AINeutralarXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification

A benchmarking study reveals demographic bias in multimodal large language models used for face verification, testing nine models across different ethnicity and gender groups. The research found that face-specialized models outperform general-purpose MLLMs, but accuracy doesn't correlate with fairness, and bias patterns differ from traditional face recognition systems.

๐Ÿข Meta
AIBullisharXiv โ€“ CS AI ยท Mar 276/10
๐Ÿง 

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs

Researchers introduce TimeLens, a family of multimodal large language models optimized for video temporal grounding that outperforms existing open-source models and even surpasses proprietary models like GPT-5 and Gemini-2.5-Flash. The work addresses critical data quality issues in existing benchmarks and introduces improved training datasets and algorithmic design principles.

๐Ÿง  GPT-5๐Ÿง  Gemini
AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Researchers introduce GameplayQA, a new benchmarking framework for evaluating multimodal large language models on 3D virtual agent perception and reasoning tasks. The framework uses densely annotated multiplayer gameplay videos with 2.4K diagnostic QA pairs, revealing substantial performance gaps between current frontier models and human-level understanding.

AIBullisharXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Place-it-R1: Unlocking Environment-aware Reasoning Potential of MLLM for Video Object Insertion

Researchers introduce Place-it-R1, an AI framework that uses Multimodal Large Language Models to insert objects into videos while maintaining physical realism. The system employs Chain-of-Thought reasoning to ensure inserted objects interact naturally with their environment, addressing the gap between visual quality and physical plausibility in video editing.

AIBullisharXiv โ€“ CS AI ยท Mar 55/10
๐Ÿง 

FeedAIde: Guiding App Users to Submit Rich Feedback Reports by Asking Context-Aware Follow-Up Questions

FeedAIde is a new AI-powered mobile app feedback system that uses Multimodal Large Language Models to guide users through submitting detailed bug reports and feature requests. The iOS framework captures contextual information like screenshots and asks follow-up questions to improve feedback quality, with testing showing enhanced completeness compared to traditional feedback forms.

Page 1 of 2Next โ†’