34 articles tagged with #multimodal-llm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 6d ago7/10
๐ง Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.
AIBullisharXiv โ CS AI ยท 6d ago7/10
๐ง Researchers propose Faithful-First RPA, a framework that improves multimodal AI reasoning by prioritizing faithfulness to visual evidence. The method uses FaithEvi for supervision and FaithAct for execution, achieving up to 24% improvement in perceptual faithfulness without sacrificing task accuracy.
AIBullisharXiv โ CS AI ยท Apr 77/10
๐ง Researchers propose Continuous Softened Retracing reSampling (CSRS) to improve the self-evolution of Multimodal Large Language Models by addressing biases in feedback mechanisms. The method uses continuous reward signals instead of binary rewards and achieves state-of-the-art results on mathematical reasoning benchmarks like MathVision using Qwen2.5-VL-7B.
AINeutralarXiv โ CS AI ยท Mar 177/10
๐ง Researchers identified that medical multimodal large language models (MLLMs) fail primarily due to inadequate visual grounding capabilities when analyzing medical images, unlike their success with natural scenes. They developed VGMED evaluation dataset and proposed VGRefine method, achieving state-of-the-art performance across 6 medical visual question-answering benchmarks without additional training.
AIBullisharXiv โ CS AI ยท Mar 177/10
๐ง Researchers developed SToRM, a new framework that reduces computational costs for autonomous driving systems using multi-modal large language models by up to 30x while maintaining performance. The system uses supervised token reduction techniques to enable real-time end-to-end driving on standard GPUs without sacrificing safety or accuracy.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers developed HeteroServe, a system that optimizes multimodal large language model inference by partitioning vision encoding and language generation across different GPU tiers. The approach reduces data transfer requirements and achieves 31-40% cost savings while improving throughput by up to 54% compared to existing systems.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers have developed Meissa, a lightweight 4B-parameter medical AI model that brings advanced agentic capabilities offline for healthcare applications. The system matches frontier models like GPT in medical benchmarks while operating with 25x fewer parameters and 22x lower latency, addressing privacy and cost concerns in clinical settings.
๐ง Gemini
AINeutralarXiv โ CS AI ยท Mar 57/10
๐ง Researchers introduce SpatialBench, a comprehensive benchmark for evaluating spatial cognition in multimodal large language models (MLLMs). The framework reveals that while MLLMs excel at perceptual grounding, they struggle with symbolic reasoning, causal inference, and planning compared to humans who demonstrate more goal-directed spatial abstraction.
AIBullisharXiv โ CS AI ยท Mar 47/103
๐ง Researchers introduce OptMerge, a new benchmark and method for combining multiple expert Multimodal Large Language Models (MLLMs) into single, more capable models without requiring additional training data. The approach achieves 2.48% average performance gains while reducing storage and serving costs by merging models across different modalities like vision, audio, and video.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง Researchers developed SpiroLLM, the first multimodal large language model capable of understanding spirogram time series data for COPD diagnosis. Using data from 234,028 UK Biobank individuals, the model achieved 0.8977 diagnostic AUROC and maintained 100% valid response rate even with missing data, far outperforming text-only models.
AIBullisharXiv โ CS AI ยท Mar 37/105
๐ง Researchers propose Vid-LLM, a new video-based 3D multimodal large language model that processes video inputs without requiring external 3D data for scene understanding. The model uses a Cross-Task Adapter module and Metric Depth Model to integrate geometric cues and maintain consistency across 3D tasks like question answering and visual grounding.
AINeutralarXiv โ CS AI ยท Feb 277/106
๐ง Researchers identified a fundamental limitation in multimodal LLMs where decoders trained on text cannot effectively utilize non-text information like speaker identity or visual textures, despite this information being preserved through all model layers. The study demonstrates this 'modality collapse' is due to decoder design rather than encoding failures, with experiments showing targeted training can improve specific modality accessibility.
AINeutralarXiv โ CS AI ยท Feb 277/106
๐ง Researchers introduce ProactiveMobile, a new benchmark for developing AI agents that can proactively anticipate user needs on mobile devices rather than just responding to commands. The benchmark includes over 3,600 test instances across 14 scenarios, with current models achieving low success rates, indicating significant room for improvement in proactive AI capabilities.
AINeutralarXiv โ CS AI ยท 3d ago6/10
๐ง Researchers investigate how multimodal large language models (MLLMs) can assist with usability evaluation of user interfaces by analyzing text and visual context together. The study compares MLLM-generated assessments against expert evaluations, finding that these models can effectively prioritize usability issues by severity and offer complementary insights to traditional resource-intensive evaluation methods.
AINeutralarXiv โ CS AI ยท 6d ago6/10
๐ง Q-Probe introduces a novel agentic framework for scaling image quality assessment to high-resolution images by addressing limitations in existing reinforcement learning approaches. The research presents Vista-Bench, a new benchmark for fine-grained degradation analysis, and demonstrates state-of-the-art performance across multiple resolution scales through context-aware probing mechanisms.
AINeutralarXiv โ CS AI ยท Apr 76/10
๐ง Researchers identify critical limitations in current Multimodal Large Language Models' ability to understand physics and physical world dynamics. They propose Scene Dynamic Field (SDF), a new approach using physics simulators that achieves up to 20.7% performance improvements on fluid dynamics tasks.
AIBullisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers have developed ForgeryGPT, a new multimodal AI framework that can detect, localize, and explain image forgeries through natural language interaction. The system combines advanced computer vision techniques with large language models to provide interpretable analysis of tampered images, addressing limitations in current forgery detection methods.
๐ง GPT-4
AINeutralarXiv โ CS AI ยท Mar 276/10
๐ง Researchers introduce ReLope, a new routing method for multimodal large language models that uses KL-regularized LoRA probes and attention mechanisms to improve cost-performance balance. The method addresses the challenge of degraded probe performance when visual inputs are added to text-only LLMs.
AINeutralarXiv โ CS AI ยท Mar 276/10
๐ง Researchers benchmarked 20 multimodal AI models on neuroimaging tasks using MRI and CT scans, finding that while technical attributes like imaging modality are nearly solved, diagnostic reasoning remains challenging. Gemini-2.5-Pro and GPT-5-Chat showed strongest diagnostic performance, while open-source MedGemma-1.5-4B demonstrated promising results under few-shot prompting.
๐ข Meta๐ง GPT-5๐ง Gemini
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง Photon is a new framework that efficiently processes 3D medical imaging for AI visual question answering by using variable-length token sequences and adaptive compression. The system reduces computational costs while maintaining accuracy through instruction-conditioned token scheduling and custom gradient propagation techniques.
AINeutralarXiv โ CS AI ยท Mar 276/10
๐ง A benchmarking study reveals demographic bias in multimodal large language models used for face verification, testing nine models across different ethnicity and gender groups. The research found that face-specialized models outperform general-purpose MLLMs, but accuracy doesn't correlate with fairness, and bias patterns differ from traditional face recognition systems.
๐ข Meta
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง Researchers introduce TimeLens, a family of multimodal large language models optimized for video temporal grounding that outperforms existing open-source models and even surpasses proprietary models like GPT-5 and Gemini-2.5-Flash. The work addresses critical data quality issues in existing benchmarks and introduces improved training datasets and algorithmic design principles.
๐ง GPT-5๐ง Gemini
AINeutralarXiv โ CS AI ยท Mar 266/10
๐ง Researchers introduce GameplayQA, a new benchmarking framework for evaluating multimodal large language models on 3D virtual agent perception and reasoning tasks. The framework uses densely annotated multiplayer gameplay videos with 2.4K diagnostic QA pairs, revealing substantial performance gaps between current frontier models and human-level understanding.
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers introduce Place-it-R1, an AI framework that uses Multimodal Large Language Models to insert objects into videos while maintaining physical realism. The system employs Chain-of-Thought reasoning to ensure inserted objects interact naturally with their environment, addressing the gap between visual quality and physical plausibility in video editing.
AIBullisharXiv โ CS AI ยท Mar 55/10
๐ง FeedAIde is a new AI-powered mobile app feedback system that uses Multimodal Large Language Models to guide users through submitting detailed bug reports and feature requests. The iOS framework captures contextual information like screenshots and asks follow-up questions to improve feedback quality, with testing showing enhanced completeness compared to traditional feedback forms.