35 articles tagged with #vlm. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท Apr 77/10
๐ง Researchers developed Sim2Real-AD, a framework that successfully transfers VLM-guided reinforcement learning policies trained in CARLA simulation to real autonomous vehicles without requiring real-world training data. The system achieved 75-90% success rates in real-world driving scenarios when deployed on a full-scale Ford E-Transit.
AIBullisharXiv โ CS AI ยท Apr 67/10
๐ง Researchers introduce IMAgent, an open-source visual AI agent trained with reinforcement learning to handle multi-image reasoning tasks. The system addresses limitations of current VLM-based agents that only process single images, using specialized tools for visual reflection and verification to maintain attention on image content throughout inference.
๐ข OpenAI๐ง o1๐ง o3
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง DriveMind introduces a new AI framework combining vision-language models with reinforcement learning for autonomous driving, achieving significant performance improvements in safety and route completion. The system demonstrates strong cross-domain generalization from simulation to real-world dash-cam data, suggesting practical deployment potential.
AIBullisharXiv โ CS AI ยท Mar 127/10
๐ง Researchers introduce Super Neurons (SNs), a new method that probes raw activations in Vision Language Models to improve classification performance while achieving up to 5.10x speedup. Unlike Sparse Attention Vectors, SNs can identify discriminative neurons in shallow layers, enabling extreme early exiting from the first layer at the first generated token.
AINeutralarXiv โ CS AI ยท Mar 56/10
๐ง Researchers introduce 'Cognition Envelopes' as a new framework to constrain AI decision-making in autonomous systems, addressing errors like hallucinations in Large Language Models and Vision-Language Models. The approach is demonstrated through autonomous drone search and rescue missions, establishing reasoning boundaries to complement traditional safety measures.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers developed DMAST, a new training framework that protects multimodal web agents from cross-modal attacks where adversaries inject malicious content into webpages to deceive both visual and text processing channels. The method uses adversarial training through a three-stage pipeline and significantly outperforms existing defenses while doubling task completion efficiency.
AIBullisharXiv โ CS AI ยท Mar 57/10
๐ง Researchers have developed TIGeR, a framework that enhances Vision-Language Models with precise geometric reasoning capabilities for robotics applications. The system enables VLMs to execute centimeter-level accurate computations by integrating external computational tools, moving beyond qualitative spatial reasoning to quantitative precision required for real-world robotic manipulation.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง Researchers introduce PhysMem, a memory framework that enables vision-language model robot planners to learn physical principles through real-time interaction without updating model parameters. The system records experiences, generates hypotheses, and verifies them before application, achieving 76% success on brick insertion tasks compared to 23% for direct experience retrieval.
AINeutralarXiv โ CS AI ยท Mar 46/103
๐ง Researchers introduce ViPlan, the first benchmark for comparing Vision-Language Model planning approaches, finding that VLM-as-grounder methods excel in visual tasks like Blocksworld while VLM-as-planner methods perform better in household robotics scenarios. The study reveals fundamental limitations in current VLMs' visual reasoning abilities, with Chain-of-Thought prompting showing no consistent benefits.
AIBullisharXiv โ CS AI ยท Feb 277/107
๐ง Researchers introduce Spatial Credit Redistribution (SCR), a training-free method that reduces hallucination in vision-language models by 4.7-6.0 percentage points. The technique redistributes attention from dominant visual patches to contextual areas, addressing the spatial credit collapse problem that causes AI models to generate false objects.
AINeutralarXiv โ CS AI ยท Apr 76/10
๐ง Researchers developed an AI framework using reinforcement learning to automatically discover failure modes in vision-language models without human intervention. The system trains a questioner agent that generates adaptive queries to expose weaknesses, successfully identifying 36 novel failure modes across various VLM combinations.
AIBearisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers introduce VLM-UnBench, the first benchmark for evaluating training-free visual concept unlearning in Vision Language Models. The study reveals that realistic prompts fail to genuinely remove sensitive or copyrighted visual concepts, with meaningful suppression only occurring under oracle conditions that explicitly disclose target concepts.
AIBullisharXiv โ CS AI ยท Mar 266/10
๐ง Researchers introduce ELITE, a new framework that enables AI embodied agents to learn from their own experiences and transfer knowledge to similar tasks. The system addresses failures in vision-language models when performing complex physical tasks by using self-reflective knowledge construction and intent-aware retrieval mechanisms.
AINeutralarXiv โ CS AI ยท Mar 266/10
๐ง Researchers investigated whether Vision-Language Models (VLMs) can reason robustly under distribution shifts and found that fine-tuned VLMs achieve high accuracy in-distribution but fail to generalize. They propose VLC, a neuro-symbolic method combining VLM-based concept recognition with circuit-based symbolic reasoning that demonstrates consistent performance under covariate shifts.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose MA-VLCM, a framework that uses pretrained vision-language models as centralized critics in multi-agent reinforcement learning instead of learning critics from scratch. This approach significantly improves sample efficiency and enables zero-shot generalization while producing compact policies suitable for resource-constrained robots.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers have introduced UVLM (Universal Vision-Language Model Loader), a Google Colab-based framework that provides a unified interface for loading, configuring, and benchmarking multiple Vision-Language Model architectures. The framework currently supports LLaVA-NeXT and Qwen2.5-VL models and enables researchers to compare different VLMs using identical evaluation protocols on custom image analysis tasks.
AIBullisharXiv โ CS AI ยท Mar 96/10
๐ง Researchers introduced VLMQ, a post-training quantization framework specifically designed for vision-language models that addresses visual over-representation and modality gaps. The method achieves significant performance improvements, including 16.45% better results on MME-RealWorld under 2-bit quantization compared to existing approaches.
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers propose AoD-IP, a new framework for protecting intellectual property in vision-language models through dynamic authorization and legality-aware assessment. The system allows flexible, user-controlled authorization that can adapt to changing deployment scenarios while preventing unauthorized use of valuable AI models.
AIBullisharXiv โ CS AI ยท Mar 37/108
๐ง Researchers propose a training-free paradigm for empowering Vision-Language Models with multi-modal search capabilities through cross-modal model merging. The approach uses Optimal Brain Merging (OBM) to combine text-based search agents with base VLMs without requiring expensive supervised training or reinforcement learning.
AIBullisharXiv โ CS AI ยท Mar 36/106
๐ง Researchers developed a Vision-Language Model capable of estimating 3D object positions from monocular RGB images for human-robot interaction. The model achieved a median accuracy of 13mm and can make acceptable predictions for robot interaction in 25% of cases, representing a five-fold improvement over baseline methods.
AIBullisharXiv โ CS AI ยท Mar 37/106
๐ง MOSAIC is a new open-source platform that enables cross-paradigm comparison and evaluation of different AI agents including reinforcement learning, large language models, vision-language models, and human decision-makers within the same environment. The platform introduces three key technical contributions: an IPC-based worker protocol, operator abstraction for unified interfaces, and a deterministic evaluation framework for reproducible research.
AIBullisharXiv โ CS AI ยท Mar 36/103
๐ง Researchers have developed ViTSP, a framework that uses pre-trained vision language models to solve large-scale Traveling Salesman Problems with average optimality gaps of just 0.24%. The system outperforms existing learning-based methods and reduces gaps by 3.57% to 100% compared to the best heuristic solver LKH-3 on instances with over 10,000 nodes.
AINeutralarXiv โ CS AI ยท Mar 36/104
๐ง Researchers introduced SpinBench, a new benchmark for evaluating spatial reasoning abilities in vision language models (VLMs), focusing on perspective taking and viewpoint transformations. Testing 43 state-of-the-art VLMs revealed systematic weaknesses including strong egocentric bias and poor rotational understanding, with human performance significantly outpacing AI models at 91.2% accuracy.
AIBullisharXiv โ CS AI ยท Mar 36/102
๐ง Researchers developed COMRES-VLM, a new framework using Vision Language Models to coordinate multiple robots for exploration and object search in indoor environments. The system achieved 10.2% faster exploration and 55.7% higher search efficiency compared to existing methods, while enabling natural language-based human guidance.
AIBullisharXiv โ CS AI ยท Mar 26/1019
๐ง Researchers introduced BEV-VLM, a new autonomous driving trajectory planning system that combines Vision-Language Models with Bird's-Eye View maps from camera and LiDAR data. The approach achieved 53.1% better planning accuracy and complete collision avoidance compared to vision-only methods on the nuScenes dataset.