#vlm News & Analysis

41 articles tagged with #vlm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

41 articles

AIBullisharXiv – CS AI · Apr 77/10

🧠

Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving

Researchers developed Sim2Real-AD, a framework that successfully transfers VLM-guided reinforcement learning policies trained in CARLA simulation to real autonomous vehicles without requiring real-world training data. The system achieved 75-90% success rates in real-world driving scenarios when deployed on a full-scale Ford E-Transit.

AIBullisharXiv – CS AI · Apr 67/10

🧠

Training Multi-Image Vision Agents via End2End Reinforcement Learning

Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.

🏢 OpenAI🧠 o1🧠 o3

AIBullisharXiv – CS AI · Mar 167/10

🧠

DriveMind: A Dual Visual Language Model-based Reinforcement Learning Framework for Autonomous Driving

DriveMind introduces a new AI framework combining vision-language models with reinforcement learning for autonomous driving, achieving significant performance improvements in safety and route completion. The system demonstrates strong cross-domain generalization from simulation to real-world dash-cam data, suggesting practical deployment potential.

AIBullisharXiv – CS AI · Mar 127/10

🧠

Taking Shortcuts for Categorical VQA Using Super Neurons

Researchers introduce Super Neurons (SNs), a new method that probes raw activations in Vision Language Models to improve classification performance while achieving up to 5.10x speedup. Unlike Sparse Attention Vectors, SNs can identify discriminative neurons in shallow layers, enabling extreme early exiting from the first layer at the first generated token.

AIBullisharXiv – CS AI · Mar 56/10

🧠

Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory

Researchers introduce PhysMem, a memory framework that enables vision-language model robot planners to learn physical principles through real-time interaction without updating model parameters. The system records experiences, generates hypotheses, and verifies them before application, achieving 76% success on brick insertion tasks compared to 23% for direct experience retrieval.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Researchers developed DMAST, a new training framework that protects multimodal web agents from cross-modal attacks where adversaries inject malicious content into webpages to deceive both visual and text processing channels. The method uses adversarial training through a three-stage pipeline and significantly outperforms existing defenses while doubling task completion efficiency.

AIBullisharXiv – CS AI · Mar 57/10

🧠

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Researchers have developed TIGeR, a framework that enhances Vision-Language Models with precise geometric reasoning capabilities for robotics applications. The system enables VLMs to execute centimeter-level accurate computations by integrating external computational tools, moving beyond qualitative spatial reasoning to quantitative precision required for real-world robotic manipulation.

AINeutralarXiv – CS AI · Mar 56/10

🧠

Cognition Envelopes for Bounded Decision Making in Autonomous UAS Operations

Researchers introduce 'Cognition Envelopes' as a new framework to constrain AI decision-making in autonomous systems, addressing errors like hallucinations in Large Language Models and Vision-Language Models. The approach is demonstrated through autonomous drone search and rescue missions, establishing reasoning boundaries to complement traditional safety measures.

AINeutralarXiv – CS AI · Mar 46/103

🧠

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Researchers introduce ViPlan, the first benchmark for comparing Vision-Language Model planning approaches, finding that VLM-as-grounder methods excel in visual tasks like Blocksworld while VLM-as-planner methods perform better in household robotics scenarios. The study reveals fundamental limitations in current VLMs' visual reasoning abilities, with Chain-of-Thought prompting showing no consistent benefits.

AIBullisharXiv – CS AI · Feb 277/107

🧠

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Researchers introduce Spatial Credit Redistribution (SCR), a training-free method that reduces hallucination in vision-language models by 4.7-6.0 percentage points. The technique redistributes attention from dominant visual patches to contextual areas, addressing the spatial credit collapse problem that causes AI models to generate false objects.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

Object-Centric Vision Token Pruning for Vision Language Models

Researchers introduce OC-VTP, a lightweight vision token pruning method for Vision Language Models that reduces computational overhead by selectively retaining the most representative visual tokens without requiring model fine-tuning. The approach maintains inference accuracy across all pruning ratios while providing computational efficiency gains and interpretability benefits.

AIBullisharXiv – CS AI · May 126/10

🧠

Do multimodal models imagine electric sheep?

Researchers demonstrate that large multimodal models develop internal visual representations when solving spatial reasoning tasks, improving puzzle-solving accuracy from 83% to 89% by integrating visual tokens into chain-of-thought reasoning. The findings suggest AI systems spontaneously form world models without explicit visual supervision, with practical applications for enhancing spatial reasoning capabilities.

AINeutralarXiv – CS AI · May 126/10

🧠

Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

Researchers introduce MarsTSC, a novel framework combining Vision Language Models with agentic reasoning for few-shot multimodal time series classification. The system uses collaborative AI roles—Generator, Reflector, and Modifier—to iteratively refine knowledge and improve classification accuracy across 12 benchmarks while providing interpretable explanations.

AIBullisharXiv – CS AI · May 126/10

🧠

Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT

Researchers have developed a knowledge distillation framework that compresses a 7B 3D vision-language model into a 2.29B student model, achieving 8.7x faster inference while retaining 54-72% performance. The approach introduces "Hidden CoT," learnable latent tokens that enable spatial reasoning without explicit chain-of-thought training data, making 3D scene understanding feasible on resource-constrained devices.

AINeutralarXiv – CS AI · May 126/10

🧠

Don't Click That: Teaching Web Agents to Resist Deceptive Interfaces

Researchers introduce DUDE, a framework that teaches AI web agents to resist deceptive interface elements through hybrid-reward learning and experience summarization. The accompanying RUC benchmark demonstrates the framework reduces susceptibility to deception by 53.8% while preserving task performance, addressing a critical vulnerability in autonomous GUI interaction systems.

AIBullisharXiv – CS AI · Apr 206/10

🧠

MM-Telco: Benchmarks and Multimodal Large Language Models for Telecom Applications

Researchers introduce MM-Telco, a comprehensive multimodal benchmark and model suite designed to adapt large language models for telecommunications applications. The framework addresses domain-specific challenges in network optimization, troubleshooting, and customer support, with fine-tuned models demonstrating significant performance improvements over baseline LLMs.

AINeutralarXiv – CS AI · Apr 76/10

🧠

Discovering Failure Modes in Vision-Language Models using RL

Researchers developed an AI framework using reinforcement learning to automatically discover failure modes in vision-language models without human intervention. The system trains a questioner agent that generates adaptive queries to expose weaknesses, successfully identifying 36 novel failure modes across various VLM combinations.

AIBearisharXiv – CS AI · Apr 66/10

🧠

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

Researchers introduce VLM-UnBench, the first benchmark for evaluating training-free visual concept unlearning in Vision Language Models. The study reveals that realistic prompts fail to genuinely remove sensitive or copyrighted visual concepts, with meaningful suppression only occurring under oracle conditions that explicitly disclose target concepts.

AIBullisharXiv – CS AI · Mar 266/10

🧠

ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

Researchers introduce ELITE, a new framework that enables AI embodied agents to learn from their own experiences and transfer knowledge to similar tasks. The system addresses failures in vision-language models when performing complex physical tasks by using self-reflective knowledge construction and intent-aware retrieval mechanisms.

AINeutralarXiv – CS AI · Mar 266/10

🧠

Can VLMs Reason Robustly? A Neuro-Symbolic Investigation

Researchers investigated whether Vision-Language Models (VLMs) can reason robustly under distribution shifts and found that fine-tuned VLMs achieve high accuracy in-distribution but fail to generalize. They propose VLC, a neuro-symbolic method combining VLM-based concept recognition with circuit-based symbolic reasoning that demonstrates consistent performance under covariate shifts.

AIBullisharXiv – CS AI · Mar 176/10

🧠

UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking

Researchers have introduced UVLM (Universal Vision-Language Model Loader), a Google Colab-based framework that provides a unified interface for loading, configuring, and benchmarking multiple Vision-Language Model architectures. The framework currently supports LLaVA-NeXT and Qwen2.5-VL models and enables researchers to compare different VLMs using identical evaluation protocols on custom image analysis tasks.

AIBullisharXiv – CS AI · Mar 176/10

🧠

MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

Researchers propose MA-VLCM, a framework that uses pretrained vision-language models as centralized critics in multi-agent reinforcement learning instead of learning critics from scratch. This approach significantly improves sample efficiency and enables zero-shot generalization while producing compact policies suitable for resource-constrained robots.

AIBullisharXiv – CS AI · Mar 96/10

🧠

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Researchers introduced VLMQ, a post-training quantization framework specifically designed for vision-language models that addresses visual over-representation and modality gaps. The method achieves significant performance improvements, including 16.45% better results on MME-RealWorld under 2-bit quantization compared to existing approaches.

AIBullisharXiv – CS AI · Mar 66/10

🧠

Authorize-on-Demand: Dynamic Authorization with Legality-Aware Intellectual Property Protection for VLMs

Researchers propose AoD-IP, a new framework for protecting intellectual property in vision-language models through dynamic authorization and legality-aware assessment. The system allows flexible, user-controlled authorization that can adapt to changing deployment scenarios while preventing unauthorized use of valuable AI models.

AIBullisharXiv – CS AI · Mar 37/108

🧠

Securing the Floor and Raising the Ceiling: A Merging-based Paradigm for Multi-modal Search Agents

Researchers propose a training-free paradigm for empowering Vision-Language Models with multi-modal search capabilities through cross-modal model merging. The approach uses Optimal Brain Merging (OBM) to combine text-based search agents with base VLMs without requiring expensive supervised training or reinforcement learning.

Page 1 of 2Next →