y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#vlm News & Analysis

35 articles tagged with #vlm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

35 articles
AIBullisharXiv โ€“ CS AI ยท Apr 77/10
๐Ÿง 

Sim2Real-AD: A Modular Sim-to-Real Framework for Deploying VLM-Guided Reinforcement Learning in Real-World Autonomous Driving

Researchers developed Sim2Real-AD, a framework that successfully transfers VLM-guided reinforcement learning policies trained in CARLA simulation to real autonomous vehicles without requiring real-world training data. The system achieved 75-90% success rates in real-world driving scenarios when deployed on a full-scale Ford E-Transit.

AIBullisharXiv โ€“ CS AI ยท Apr 67/10
๐Ÿง 

Training Multi-Image Vision Agents via End2End Reinforcement Learning

Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.

๐Ÿข OpenAI๐Ÿง  o1๐Ÿง  o3
AIBullisharXiv โ€“ CS AI ยท Mar 167/10
๐Ÿง 

DriveMind: A Dual Visual Language Model-based Reinforcement Learning Framework for Autonomous Driving

DriveMind introduces a new AI framework combining vision-language models with reinforcement learning for autonomous driving, achieving significant performance improvements in safety and route completion. The system demonstrates strong cross-domain generalization from simulation to real-world dash-cam data, suggesting practical deployment potential.

AIBullisharXiv โ€“ CS AI ยท Mar 127/10
๐Ÿง 

Taking Shortcuts for Categorical VQA Using Super Neurons

Researchers introduce Super Neurons (SNs), a new method that probes raw activations in Vision Language Models to improve classification performance while achieving up to 5.10x speedup. Unlike Sparse Attention Vectors, SNs can identify discriminative neurons in shallow layers, enabling extreme early exiting from the first layer at the first generated token.

AINeutralarXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Cognition Envelopes for Bounded Decision Making in Autonomous UAS Operations

Researchers introduce 'Cognition Envelopes' as a new framework to constrain AI decision-making in autonomous systems, addressing errors like hallucinations in Large Language Models and Vision-Language Models. The approach is demonstrated through autonomous drone search and rescue missions, establishing reasoning boundaries to complement traditional safety measures.

AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

Dual-Modality Multi-Stage Adversarial Safety Training: Robustifying Multimodal Web Agents Against Cross-Modal Attacks

Researchers developed DMAST, a new training framework that protects multimodal web agents from cross-modal attacks where adversaries inject malicious content into webpages to deceive both visual and text processing channels. The method uses adversarial training through a three-stage pipeline and significantly outperforms existing defenses while doubling task completion efficiency.

AIBullisharXiv โ€“ CS AI ยท Mar 57/10
๐Ÿง 

TIGeR: Tool-Integrated Geometric Reasoning in Vision-Language Models for Robotics

Researchers have developed TIGeR, a framework that enhances Vision-Language Models with precise geometric reasoning capabilities for robotics applications. The system enables VLMs to execute centimeter-level accurate computations by integrating external computational tools, moving beyond qualitative spatial reasoning to quantitative precision required for real-world robotic manipulation.

AIBullisharXiv โ€“ CS AI ยท Mar 56/10
๐Ÿง 

Learning Physical Principles from Interaction: Self-Evolving Planning via Test-Time Memory

Researchers introduce PhysMem, a memory framework that enables vision-language model robot planners to learn physical principles through real-time interaction without updating model parameters. The system records experiences, generates hypotheses, and verifies them before application, achieving 76% success on brick insertion tasks compared to 23% for direct experience retrieval.

AINeutralarXiv โ€“ CS AI ยท Mar 46/103
๐Ÿง 

ViPlan: A Benchmark for Visual Planning with Symbolic Predicates and Vision-Language Models

Researchers introduce ViPlan, the first benchmark for comparing Vision-Language Model planning approaches, finding that VLM-as-grounder methods excel in visual tasks like Blocksworld while VLM-as-planner methods perform better in household robotics scenarios. The study reveals fundamental limitations in current VLMs' visual reasoning abilities, with Chain-of-Thought prompting showing no consistent benefits.

AIBullisharXiv โ€“ CS AI ยท Feb 277/107
๐Ÿง 

Beyond Dominant Patches: Spatial Credit Redistribution For Grounded Vision-Language Models

Researchers introduce Spatial Credit Redistribution (SCR), a training-free method that reduces hallucination in vision-language models by 4.7-6.0 percentage points. The technique redistributes attention from dominant visual patches to contextual areas, addressing the spatial credit collapse problem that causes AI models to generate false objects.

AINeutralarXiv โ€“ CS AI ยท Apr 76/10
๐Ÿง 

Discovering Failure Modes in Vision-Language Models using RL

Researchers developed an AI framework using reinforcement learning to automatically discover failure modes in vision-language models without human intervention. The system trains a questioner agent that generates adaptive queries to expose weaknesses, successfully identifying 36 novel failure modes across various VLM combinations.

AIBearisharXiv โ€“ CS AI ยท Apr 66/10
๐Ÿง 

Can VLMs Truly Forget? Benchmarking Training-Free Visual Concept Unlearning

Researchers introduce VLM-UnBench, the first benchmark for evaluating training-free visual concept unlearning in Vision Language Models. The study reveals that realistic prompts fail to genuinely remove sensitive or copyrighted visual concepts, with meaningful suppression only occurring under oracle conditions that explicitly disclose target concepts.

AIBullisharXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

ELITE: Experiential Learning and Intent-Aware Transfer for Self-improving Embodied Agents

Researchers introduce ELITE, a new framework that enables AI embodied agents to learn from their own experiences and transfer knowledge to similar tasks. The system addresses failures in vision-language models when performing complex physical tasks by using self-reflective knowledge construction and intent-aware retrieval mechanisms.

AINeutralarXiv โ€“ CS AI ยท Mar 266/10
๐Ÿง 

Can VLMs Reason Robustly? A Neuro-Symbolic Investigation

Researchers investigated whether Vision-Language Models (VLMs) can reason robustly under distribution shifts and found that fine-tuned VLMs achieve high accuracy in-distribution but fail to generalize. They propose VLC, a neuro-symbolic method combining VLM-based concept recognition with circuit-based symbolic reasoning that demonstrates consistent performance under covariate shifts.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

MA-VLCM: A Vision Language Critic Model for Value Estimation of Policies in Multi-Agent Team Settings

Researchers propose MA-VLCM, a framework that uses pretrained vision-language models as centralized critics in multi-agent reinforcement learning instead of learning critics from scratch. This approach significantly improves sample efficiency and enables zero-shot generalization while producing compact policies suitable for resource-constrained robots.

AIBullisharXiv โ€“ CS AI ยท Mar 176/10
๐Ÿง 

UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking

Researchers have introduced UVLM (Universal Vision-Language Model Loader), a Google Colab-based framework that provides a unified interface for loading, configuring, and benchmarking multiple Vision-Language Model architectures. The framework currently supports LLaVA-NeXT and Qwen2.5-VL models and enables researchers to compare different VLMs using identical evaluation protocols on custom image analysis tasks.

AIBullisharXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Researchers introduced VLMQ, a post-training quantization framework specifically designed for vision-language models that addresses visual over-representation and modality gaps. The method achieves significant performance improvements, including 16.45% better results on MME-RealWorld under 2-bit quantization compared to existing approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 36/106
๐Ÿง 

Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction

Researchers developed a Vision-Language Model capable of estimating 3D object positions from monocular RGB images for human-robot interaction. The model achieved a median accuracy of 13mm and can make acceptable predictions for robot interaction in 25% of cases, representing a five-fold improvement over baseline methods.

AIBullisharXiv โ€“ CS AI ยท Mar 37/106
๐Ÿง 

MOSAIC: A Unified Platform for Cross-Paradigm Comparison and Evaluation of Homogeneous and Heterogeneous Multi-Agent RL, LLM, VLM, and Human Decision-Makers

MOSAIC is a new open-source platform that enables cross-paradigm comparison and evaluation of different AI agents including reinforcement learning, large language models, vision-language models, and human decision-makers within the same environment. The platform introduces three key technical contributions: an IPC-based worker protocol, operator abstraction for unified interfaces, and a deterministic evaluation framework for reproducible research.

AIBullisharXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

ViTSP: A Vision Language Models Guided Framework for Solving Large-Scale Traveling Salesman Problems

Researchers have developed ViTSP, a framework that uses pre-trained vision language models to solve large-scale Traveling Salesman Problems with average optimality gaps of just 0.24%. The system outperforms existing learning-based methods and reduces gaps by 3.57% to 100% compared to the best heuristic solver LKH-3 on instances with over 10,000 nodes.

AINeutralarXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

Researchers introduced SpinBench, a new benchmark for evaluating spatial reasoning abilities in vision language models (VLMs), focusing on perspective taking and viewpoint transformations. Testing 43 state-of-the-art VLMs revealed systematic weaknesses including strong egocentric bias and poor rotational understanding, with human performance significantly outpacing AI models at 91.2% accuracy.

AIBullisharXiv โ€“ CS AI ยท Mar 36/102
๐Ÿง 

COMRES-VLM: Coordinated Multi-Robot Exploration and Search using Vision Language Models

Researchers developed COMRES-VLM, a new framework using Vision Language Models to coordinate multiple robots for exploration and object search in indoor environments. The system achieved 10.2% faster exploration and 55.7% higher search efficiency compared to existing methods, while enabling natural language-based human guidance.

AIBullisharXiv โ€“ CS AI ยท Mar 26/1019
๐Ÿง 

BEV-VLM: Trajectory Planning via Unified BEV Abstraction

Researchers introduced BEV-VLM, a new autonomous driving trajectory planning system that combines Vision-Language Models with Bird's-Eye View maps from camera and LiDAR data. The approach achieved 53.1% better planning accuracy and complete collision avoidance compared to vision-only methods on the nuScenes dataset.

Page 1 of 2Next โ†’