y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#vision-language-models News & Analysis

160 articles tagged with #vision-language-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

160 articles
AINeutralarXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Probing Visual Concepts in Lightweight Vision-Language Models for Automated Driving

Researchers analyzed Vision-Language Models (VLMs) used in automated driving to understand why they fail on simple visual tasks. They identified two failure modes: perceptual failure where visual information isn't encoded, and cognitive failure where information is present but not properly aligned with language semantics.

AIBullisharXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

HiPP-Prune: Hierarchical Preference-Conditioned Structured Pruning for Vision-Language Models

Researchers introduce HiPP-Prune, a new framework for efficiently compressing vision-language models while maintaining performance and reducing hallucinations. The hierarchical approach uses preference-based pruning that considers multiple objectives including task utility, visual grounding, and compression efficiency.

AIBullisharXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

Researchers developed DEX-AR, a new explainability method for autoregressive Vision-Language Models that generates 2D heatmaps to understand how these AI systems make decisions. The method addresses challenges in interpreting modern VLMs by analyzing token-by-token generation and visual-textual interactions, showing improved performance across multiple benchmarks.

๐Ÿข Perplexity
AIBullisharXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

VLMQ: Token Saliency-Driven Post-Training Quantization for Vision-language Models

Researchers introduced VLMQ, a post-training quantization framework specifically designed for vision-language models that addresses visual over-representation and modality gaps. The method achieves significant performance improvements, including 16.45% better results on MME-RealWorld under 2-bit quantization compared to existing approaches.

AIBullisharXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

Think with 3D: Geometric Imagination Grounded Spatial Reasoning from Limited Views

Researchers introduce 3DThinker, a new framework that enables vision-language models to perform 3D spatial reasoning from limited 2D views without requiring 3D training data. The system uses a two-stage training approach to align 3D representations with foundation models and demonstrates superior performance across multiple benchmarks.

AIBullisharXiv โ€“ CS AI ยท Mar 96/10
๐Ÿง 

CASA: Cross-Attention over Self-Attention for Efficient Vision-Language Fusion

Researchers present CASA, a new approach using cross-attention over self-attention for vision-language models that maintains competitive performance while significantly reducing memory and compute costs. The method shows particular advantages for real-time applications like video captioning by avoiding expensive token insertion into language model streams.

AIBullisharXiv โ€“ CS AI ยท Mar 66/10
๐Ÿง 

Differentially Private Multimodal In-Context Learning

Researchers introduce DP-MTV, the first framework enabling privacy-preserving multimodal in-context learning for vision-language models using differential privacy. The system allows processing hundreds of demonstrations while maintaining formal privacy guarantees, achieving competitive performance on benchmarks like VizWiz with only minimal accuracy loss.

AINeutralarXiv โ€“ CS AI ยท Mar 66/10
๐Ÿง 

Context-Dependent Affordance Computation in Vision-Language Models

Researchers found that vision-language models like Qwen-VL and LLaVA compute object affordances in highly context-dependent ways, with over 90% of scene descriptions changing based on contextual priming. The study reveals that these AI models don't have fixed understanding of objects but dynamically interpret them based on different situational contexts.

AIBullisharXiv โ€“ CS AI ยท Mar 55/10
๐Ÿง 

GarmentPile++: Affordance-Driven Cluttered Garments Retrieval with Vision-Language Reasoning

Researchers developed GarmentPile++, an AI pipeline that uses vision-language models to retrieve individual garments from cluttered piles following natural language instructions. The system integrates visual affordance perception with dual-arm robotics to handle complex garment manipulation tasks in real-world home assistant applications.

AIBullisharXiv โ€“ CS AI ยท Mar 45/104
๐Ÿง 

VL-KGE: Vision-Language Models Meet Knowledge Graph Embeddings

Researchers have developed VL-KGE, a new framework that combines Vision-Language Models with Knowledge Graph Embeddings to better process multimodal knowledge graphs. The approach addresses limitations in existing methods by enabling stronger cross-modal alignment and more unified representations across diverse data types.

$LINK
AINeutralarXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

SpinBench: Perspective and Rotation as a Lens on Spatial Reasoning in VLMs

Researchers introduced SpinBench, a new benchmark for evaluating spatial reasoning abilities in vision language models (VLMs), focusing on perspective taking and viewpoint transformations. Testing 43 state-of-the-art VLMs revealed systematic weaknesses including strong egocentric bias and poor rotational understanding, with human performance significantly outpacing AI models at 91.2% accuracy.

AIBullisharXiv โ€“ CS AI ยท Mar 36/102
๐Ÿง 

COMRES-VLM: Coordinated Multi-Robot Exploration and Search using Vision Language Models

Researchers developed COMRES-VLM, a new framework using Vision Language Models to coordinate multiple robots for exploration and object search in indoor environments. The system achieved 10.2% faster exploration and 55.7% higher search efficiency compared to existing methods, while enabling natural language-based human guidance.

AIBullisharXiv โ€“ CS AI ยท Mar 36/104
๐Ÿง 

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Researchers introduce AdaptVision, a new Vision-Language Model that reduces computational overhead by adaptively determining the minimum visual tokens needed per sample. The model uses a coarse-to-fine approach with reinforcement learning to balance accuracy and efficiency, achieving superior performance while consuming fewer visual tokens than existing methods.

AIBullisharXiv โ€“ CS AI ยท Mar 36/1010
๐Ÿง 

Efficient Long-Horizon GUI Agents via Training-Free KV Cache Compression

Researchers developed ST-Lite, a training-free KV cache compression framework that accelerates GUI agents by 2.45x while using only 10-20% of the cache budget. The solution addresses memory and latency constraints in Vision-Language Models for autonomous GUI interactions through specialized attention pattern optimization.

AIBullisharXiv โ€“ CS AI ยท Mar 36/106
๐Ÿง 

Stateful Token Reduction for Long-Video Hybrid VLMs

Researchers developed a new token reduction method for hybrid vision-language models that process long videos, achieving 3.8-4.2x speedup while retaining only 25% of visual tokens. The approach uses progressive reduction and unified scoring for both attention and Mamba blocks, maintaining near-baseline accuracy on long-context video benchmarks.

$NEAR
AIBullisharXiv โ€“ CS AI ยท Mar 36/106
๐Ÿง 

Monocular 3D Object Position Estimation with VLMs for Human-Robot Interaction

Researchers developed a Vision-Language Model capable of estimating 3D object positions from monocular RGB images for human-robot interaction. The model achieved a median accuracy of 13mm and can make acceptable predictions for robot interaction in 25% of cases, representing a five-fold improvement over baseline methods.

AIBullisharXiv โ€“ CS AI ยท Mar 36/108
๐Ÿง 

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

Researchers introduce Multi-View Video Reward Shaping (MVR), a new reinforcement learning framework that uses multi-viewpoint video analysis and vision-language models to improve reward design for complex AI tasks. The system addresses limitations of single-image approaches by analyzing dynamic motions across multiple camera angles, showing improved performance on humanoid locomotion and manipulation tasks.

AINeutralarXiv โ€“ CS AI ยท Mar 36/103
๐Ÿง 

OmniSpatial: Towards Comprehensive Spatial Reasoning Benchmark for Vision Language Models

Researchers introduce OmniSpatial, a comprehensive benchmark for testing spatial reasoning capabilities in vision-language models (VLMs). The benchmark reveals significant limitations in both open and closed-source VLMs across four major spatial reasoning categories, with over 8,400 question-answer pairs testing advanced cognitive abilities.

$NEAR
AIBullisharXiv โ€“ CS AI ยท Mar 26/1015
๐Ÿง 

DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation

Researchers introduce DesignSense-10k, a dataset of 10,235 human-annotated preference pairs for evaluating graphic layout generation, along with DesignSense, a specialized AI model that outperforms existing models by 54.6% in layout quality assessment. The framework addresses the gap between AI-generated layouts and human aesthetic preferences, showing practical improvements in layout generation through reinforcement learning.

AIBullisharXiv โ€“ CS AI ยท Mar 26/1013
๐Ÿง 

3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection

Researchers developed MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language models for 3D MRI multi-organ abnormality detection. The framework addresses challenges in modality-specific alignment and cross-modal feature fusion, demonstrating superior performance on a curated dataset of 7,392 3D MRI volume-report pairs.