#mllm News & Analysis

46 articles tagged with #mllm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

46 articles

AINeutralarXiv – CS AI · 3d ago7/10

🧠

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

Researchers introduce EgoBench, a new benchmark for evaluating AI agents' ability to perceive visual information, reason through multi-step tasks, and interact with users in real-world scenarios. Testing eight state-of-the-art video models reveals significant limitations, with the best performer achieving only 30.62% accuracy, exposing critical gaps in current AI agent capabilities.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

Bandwidth-Efficient and Privacy-Preserving Edge-Cloud Many-to-Many Speech Translation

Researchers introduce ESRT, a privacy-preserving edge-cloud framework for multilingual speech-to-text translation that processes voice data locally while transmitting only compressed features to the cloud. The system achieves state-of-the-art performance across 45 languages while reducing bandwidth requirements by 10x and preventing voiceprint leakage.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

Researchers introduce VITAL, a latent-space reasoning framework for medical AI models that uses dual visual-semantic supervision to improve medical visual question answering while maintaining interpretability. The method addresses modality collapse and inference efficiency issues in existing approaches, achieving state-of-the-art results on 7 benchmarks using a newly constructed 61K medical imaging dataset.

AIBullisharXiv – CS AI · May 127/10

🧠

GuardAD: Safeguarding Autonomous Driving MLLMs via Markovian Safety Logic

Researchers introduce GuardAD, a safety framework that enhances autonomous driving systems using multimodal large language models (MLLMs) by incorporating Markovian logic to detect and prevent accidents. The model-agnostic safeguard reduces accident rates by 32% while improving task performance, combining neuro-symbolic logic with dynamic action revision rather than simple action veto mechanisms.

AIBullisharXiv – CS AI · Apr 77/10

🧠

V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators

Researchers introduce V-Reflection, a new framework that transforms Multimodal Large Language Models (MLLMs) from passive observers to active interrogators through a 'think-then-look' mechanism. The approach addresses perception-related hallucinations in fine-grained tasks by allowing models to dynamically re-examine visual details during reasoning, showing significant improvements across six perception-intensive benchmarks.

AINeutralarXiv – CS AI · Apr 67/10

🧠

Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

Researchers propose the Hallucination-as-Cue Framework to analyze reinforcement learning's effectiveness in training multimodal AI models. The study reveals that RL training can improve reasoning performance even under hallucination-inductive conditions, challenging assumptions about how these models learn from visual information.

AIBearisharXiv – CS AI · Mar 267/10

🧠

When Understanding Becomes a Risk: Authenticity and Safety Risks in the Emerging Image Generation Paradigm

Research reveals that multimodal large language models (MLLMs) pose greater safety risks than diffusion models for image generation, producing more unsafe content and creating images that are harder for detection systems to identify. The enhanced semantic understanding capabilities of MLLMs, while more powerful, enable them to interpret complex prompts that lead to dangerous outputs including fake image synthesis.

AIBullisharXiv – CS AI · Mar 177/10

🧠

AD-Copilot: A Vision-Language Assistant for Industrial Anomaly Detection via Visual In-context Comparison

Researchers developed AD-Copilot, a specialized multimodal AI assistant for industrial anomaly detection that outperforms existing models and even human experts. The system uses a novel visual comparison approach and achieved 82.3% accuracy on benchmarks, representing up to 3.35x improvement over baselines.

🏢 Microsoft

AIBullisharXiv – CS AI · Mar 177/10

🧠

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Researchers introduce PRIMO R1, a 7B parameter AI framework that transforms video MLLMs from passive observers into active critics for robotic manipulation tasks. The system uses reinforcement learning to achieve 50% better accuracy than specialized baselines and outperforms 72B-scale models, establishing state-of-the-art performance on the RoboFail benchmark.

🏢 OpenAI🧠 o1

AINeutralarXiv – CS AI · Mar 117/10

🧠

OOD-MMSafe: Advancing MLLM Safety from Harmful Intent to Hidden Consequences

Researchers introduce OOD-MMSafe, a new benchmark revealing that current Multimodal Large Language Models fail to identify hidden safety risks up to 67.5% of the time. They developed CASPO framework which dramatically reduces failure rates to under 8% for risk identification in consequence-driven safety scenarios.

AINeutralarXiv – CS AI · Mar 56/10

🧠

Benchmarking MLLM-based Web Understanding: Reasoning, Robustness and Safety

Researchers introduced WebRRSBench, a comprehensive benchmark evaluating multimodal large language models' reasoning, robustness, and safety capabilities for web understanding tasks. Testing 11 MLLMs on 3,799 QA pairs from 729 websites revealed significant gaps in compositional reasoning, UI robustness, and safety-critical action recognition.

AIBullisharXiv – CS AI · Mar 56/10

🧠

EvoPrune: Early-Stage Visual Token Pruning for Efficient MLLMs

Researchers developed EvoPrune, a new method that prunes visual tokens during the encoding stage of Multimodal Large Language Models (MLLMs) rather than after encoding. The technique achieves 2x inference speedup with less than 1% performance loss on video datasets, addressing efficiency bottlenecks in AI models processing high-resolution images and videos.

AIBullisharXiv – CS AI · Mar 56/10

🧠

GeoSeg: Training-Free Reasoning-Driven Segmentation in Remote Sensing Imagery

Researchers introduce GeoSeg, a zero-shot, training-free framework for AI-driven segmentation of remote sensing imagery that uses multimodal language models for reasoning without requiring specialized training data. The system addresses domain-specific challenges in satellite and aerial image analysis through bias-aware coordinate refinement and dual-route prompting mechanisms.

AIBullisharXiv – CS AI · Mar 56/10

🧠

Multimodal Large Language Models for Low-Resource Languages: A Case Study for Basque

Researchers successfully developed multimodal large language models for Basque, a low-resource language, finding that only 20% Basque training data is needed for solid performance. The study demonstrates that specialized Basque language backbones aren't required, potentially enabling MLLM development for other underrepresented languages.

🧠 Llama

AIBullisharXiv – CS AI · Mar 46/104

🧠

OCR or Not? Rethinking Document Information Extraction in the MLLMs Era with Real-World Large-Scale Datasets

A large-scale benchmarking study finds that powerful Multimodal Large Language Models (MLLMs) can extract information from business documents using image-only input, potentially eliminating the need for traditional OCR preprocessing. The research demonstrates that well-designed prompts and instructions can further enhance MLLM performance in document processing tasks.

AINeutralarXiv – CS AI · Mar 37/104

🧠

Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping

Researchers introduce Interaction2Code, the first benchmark for evaluating Multimodal Large Language Models' ability to generate interactive webpage code from prototypes. The study identifies four critical limitations in current MLLMs and proposes enhancement strategies to improve their performance on dynamic web interactions.

AINeutralarXiv – CS AI · Mar 37/103

🧠

MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Researchers introduced MMR-Life, a comprehensive benchmark with 2,646 questions and 19,108 real-world images to evaluate multimodal reasoning capabilities of AI models. Even top models like GPT-5 achieved only 58% accuracy, highlighting significant challenges in real-world multimodal reasoning across seven different reasoning types.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Adaptive Multimodal Agents-Based Framework for Automatic Workflow Execution

Researchers propose a novel multimodal multi-agent framework that uses graph-based knowledge construction and adaptive retrieval-augmented generation to enable autonomous agents to execute complex workflows more effectively. The system combines offline discovery of workflow topology from execution logs with real-time collaborative verification, demonstrating improved performance in novel scenarios with limited training data.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

Researchers introduce VCap, a reinforcement learning reward mechanism that improves visual captioning in multimodal AI models by grounding caption verification in actual visual signals. An 8B parameter model trained with VCap outperforms larger open and closed-source competitors on image and video captioning benchmarks, demonstrating that smarter reward design can enable weak-to-strong generalization in AI training.

AIBullisharXiv – CS AI · May 116/10

🧠

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Researchers introduce BalCapRL, a reinforcement learning framework that improves multimodal image captioning by balancing three competing objectives: utility-aware correctness, reference coverage, and linguistic quality. The method achieves significant performance gains across multiple models by applying reward-decoupled normalization and length-conditional masking, addressing the trade-offs present in existing captioning approaches.

AINeutralarXiv – CS AI · May 96/10

🧠

ICU-Bench:Benchmarking Continual Unlearning in Multimodal Large Language Models

Researchers introduce ICU-Bench, a new benchmark for testing machine unlearning in multimodal AI models, addressing privacy concerns from large-scale training datasets. The benchmark reveals that current unlearning methods struggle with continuous privacy deletion requests, highlighting a critical gap between theoretical approaches and real-world deployment needs.

AINeutralarXiv – CS AI · May 96/10

🧠

CrossCult-KIBench: A Benchmark for Cross-Cultural Knowledge Insertion in MLLMs

Researchers introduce CrossCult-KIBench, a benchmark dataset for evaluating how multimodal large language models (MLLMs) handle cross-cultural knowledge insertion across English, Chinese, and Arabic contexts. The work reveals that current AI models struggle to adapt to specific cultural contexts without degrading performance in other cultures, establishing a new research direction for culturally-aware AI systems.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

Researchers propose Trajectory Induced Preference Optimization (TIPO), a novel method for training mobile GUI agents to respect user privacy preferences while maintaining task execution capability. The approach addresses the challenge that privacy-conscious users generate structurally different execution patterns than utility-focused users, requiring specialized optimization techniques to properly align agent behavior with individual privacy preferences.

AIBullisharXiv – CS AI · Apr 146/10

🧠

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Researchers introduce BoxTuning, a novel approach for improving video understanding in multimodal AI models by rendering object bounding boxes directly onto video frames as visual prompts rather than encoding them as text tokens. The method achieves 87-93% reduction in text token usage while maintaining full temporal resolution, demonstrating superior performance on video question-answering tasks.

AIBullisharXiv – CS AI · Apr 146/10

🧠

M$^3$KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Researchers introduce M³KG-RAG, a novel multimodal retrieval-augmented generation system that enhances large language models by integrating multi-hop knowledge graphs with audio-visual data. The approach improves reasoning depth and answer accuracy by filtering irrelevant information through a new grounding and pruning mechanism called GRASP.

$KG

Page 1 of 2Next →