y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#vision-language-models News & Analysis

132 articles tagged with #vision-language-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

132 articles
AIBullisharXiv – CS AI · 3d ago7/10
🧠

Listener-Rewarded Thinking in VLMs for Image Preferences

Researchers introduce a listener-augmented reinforcement learning framework for training vision-language models to better align with human visual preferences. By using an independent frozen model to evaluate and validate reasoning chains, the approach achieves 67.4% accuracy on ImageReward benchmarks and demonstrates significant improvements in out-of-distribution generalization.

🏢 Hugging Face
AIBearisharXiv – CS AI · 6d ago7/10
🧠

When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

Researchers introduce the Graded Color Attribution dataset to test whether Vision-Language Models faithfully follow their own stated reasoning rules. The study reveals that VLMs systematically violate their introspective rules in up to 60% of cases, while humans remain consistent, suggesting VLM self-knowledge is fundamentally miscalibrated with serious implications for high-stakes deployment.

🧠 GPT-5
AIBullisharXiv – CS AI · 6d ago7/10
🧠

Faithful-First Reasoning, Planning, and Acting for Multimodal LLMs

Researchers propose Faithful-First RPA, a framework that improves multimodal AI reasoning by prioritizing faithfulness to visual evidence. The method uses FaithEvi for supervision and FaithAct for execution, achieving up to 24% improvement in perceptual faithfulness without sacrificing task accuracy.

AIBullisharXiv – CS AI · 6d ago7/10
🧠

Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models

Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.

AIBullisharXiv – CS AI · 6d ago7/10
🧠

SALLIE: Safeguarding Against Latent Language & Image Exploits

Researchers introduce SALLIE, a lightweight runtime defense framework that detects and mitigates jailbreak attacks and prompt injections in large language and vision-language models simultaneously. Using mechanistic interpretability and internal model activations, SALLIE achieves robust protection across multiple architectures without degrading performance or requiring architectural changes.

AIBullisharXiv – CS AI · 6d ago7/10
🧠

Asking like Socrates: Socrates helps VLMs understand remote sensing images

Researchers introduce RS-EoT (Remote Sensing Evidence-of-Thought), a novel framework that enables vision-language models to reason more effectively about satellite imagery by iteratively seeking visual evidence rather than relying on linguistic patterns. The approach uses a self-play multi-agent system called SocraticAgent and reinforcement learning to address the 'Glance Effect,' where models superficially analyze large-scale remote sensing images, achieving state-of-the-art performance on multiple benchmarks.

AIBullisharXiv – CS AI · 6d ago7/10
🧠

Not All Tokens See Equally: Perception-Grounded Policy Optimization for Large Vision-Language Models

Researchers introduce Perception-Grounded Policy Optimization (PGPO), a novel fine-tuning framework that improves how large vision-language models learn from visual inputs by strategically allocating learning signals to vision-dependent tokens rather than treating all tokens equally. Testing on the Qwen2.5-VL series demonstrates an average 18.7% performance boost across multimodal reasoning benchmarks.

AIBearisharXiv – CS AI · 6d ago7/10
🧠

Invisible to Humans, Triggered by Agents: Stealthy Jailbreak Attacks on Mobile Vision-Language Agents

Researchers have discovered a new attack vulnerability in mobile vision-language agents where malicious prompts remain invisible to human users but are triggered during autonomous agent interactions. Using an optimization method called HG-IDA*, attackers can achieve 82.5% planning and 75.0% execution hijack rates on GPT-4o by exploiting the lack of touch signals during agent operations, exposing a critical security gap in deployed mobile AI systems.

🧠 GPT-4
AIBullisharXiv – CS AI · 6d ago7/10
🧠

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

Researchers introduce SAVANT, a model-agnostic framework that improves Vision Language Models' ability to detect semantic anomalies in autonomous driving scenarios by 18.5% through structured reasoning instead of ad hoc prompting. The team used this approach to label 10,000 real-world images and fine-tuned an open-source 7B model achieving 90.8% recall, demonstrating practical deployment feasibility without proprietary model dependency.

AIBullisharXiv – CS AI · Apr 77/10
🧠

Build on Priors: Vision--Language--Guided Neuro-Symbolic Imitation Learning for Data-Efficient Real-World Robot Manipulation

Researchers have developed a neuro-symbolic framework that enables robots to learn complex manipulation tasks from as few as one demonstration, without requiring manual programming or large datasets. The system uses Vision-Language Models to automatically construct symbolic planning domains and has been validated on real industrial equipment including forklifts and robotic arms.

AINeutralarXiv – CS AI · Apr 77/10
🧠

When Does Multimodal AI Help? Diagnostic Complementarity of Vision-Language Models and CNNs for Spectrum Management in Satellite-Terrestrial Networks

Researchers developed SpectrumQA, a benchmark comparing vision-language models (VLMs) and CNNs for spectrum management in satellite-terrestrial networks. The study reveals task-dependent complementarity: CNNs excel at spatial localization while VLMs uniquely enable semantic reasoning capabilities that CNNs lack entirely.

AINeutralarXiv – CS AI · Mar 277/10
🧠

Sparse Visual Thought Circuits in Vision-Language Models

Research reveals that sparse autoencoder (SAE) features in vision-language models often fail to compose modularly for reasoning tasks. The study finds that combining task-selective feature sets frequently causes output drift and accuracy degradation, challenging assumptions used in AI model steering methods.

AIBullisharXiv – CS AI · Mar 277/10
🧠

GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

Researchers developed GoldiCLIP, a data-efficient vision-language model that achieves state-of-the-art performance using only 30 million images - 300x less data than leading methods. The framework combines three key innovations including text-conditioned self-distillation, VQA-integrated encoding, and uncertainty-based loss weighting to significantly improve image-text retrieval tasks.

AIBearisharXiv – CS AI · Mar 277/10
🧠

Shape and Substance: Dual-Layer Side-Channel Attacks on Local Vision-Language Models

Researchers discovered significant privacy vulnerabilities in local Vision-Language Models that use Dynamic High-Resolution preprocessing. The dual-layer attack framework can exploit execution-time variations and cache patterns to infer sensitive information about processed images, even when models run locally for privacy.

AIBullisharXiv – CS AI · Mar 267/10
🧠

SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

Researchers developed SCoOP, a training-free framework that combines multiple Vision-Language Models to improve uncertainty quantification and reduce hallucinations in AI systems. The method achieves 10-13% better hallucination detection performance compared to existing approaches while adding only microsecond-level overhead to processing time.

AIBullishIEEE Spectrum – AI · Mar 257/10
🧠

Training Driving AI at 50,000× Real Time

General Motors is developing scalable AI systems that can train autonomous driving at 50,000x real-time speed through high-fidelity simulations. The company combines Vision Language Action models, reinforcement learning, and millions of daily simulations to handle rare 'long-tail' driving scenarios that current systems struggle with.

Training Driving AI at 50,000× Real Time
AIBearisharXiv – CS AI · Mar 177/10
🧠

Narrow Fine-Tuning Erodes Safety Alignment in Vision-Language Agents

Research reveals that fine-tuning aligned vision-language AI models on narrow harmful datasets causes severe safety degradation that generalizes across unrelated tasks. The study shows multimodal models exhibit 70% higher misalignment than text-only evaluation suggests, with even 10% harmful training data causing substantial alignment loss.

AIBullisharXiv – CS AI · Mar 177/10
🧠

Directional Embedding Smoothing for Robust Vision Language Models

Researchers have extended the RESTA defense mechanism to vision-language models (VLMs) to protect against jailbreaking attacks that can cause AI systems to produce harmful outputs. The study found that directional embedding noise significantly reduces attack success rates across the JailBreakV-28K benchmark, providing a lightweight security layer for AI agent systems.

AIBullisharXiv – CS AI · Mar 177/10
🧠

Rationale-Enhanced Decoding for Multi-modal Chain-of-Thought

Researchers have developed rationale-enhanced decoding (RED), a new inference-time strategy that improves chain-of-thought reasoning in large vision-language models. The method addresses the problem where LVLMs ignore generated rationales by harmonizing visual and rationale information during decoding, showing consistent improvements across multiple benchmarks.

AIBullisharXiv – CS AI · Mar 127/10
🧠

Are Video Reasoning Models Ready to Go Outside?

Researchers propose ROVA, a new training framework that improves vision-language models' robustness in real-world conditions by up to 24% accuracy gains. The framework addresses performance degradation from weather, occlusion, and camera motion that can cause up to 35% accuracy drops in current models.

AIBullisharXiv – CS AI · Mar 127/10
🧠

Hybrid Self-evolving Structured Memory for GUI Agents

Researchers developed HyMEM, a brain-inspired hybrid memory system that significantly improves GUI agents' ability to interact with computers. The system uses graph-based structured memory combining symbolic nodes with trajectory embeddings, enabling smaller 7B/8B models to match or exceed performance of larger closed-source models like GPT-4o.

🧠 GPT-4
AIBullisharXiv – CS AI · Mar 127/10
🧠

Taking Shortcuts for Categorical VQA Using Super Neurons

Researchers introduce Super Neurons (SNs), a new method that probes raw activations in Vision Language Models to improve classification performance while achieving up to 5.10x speedup. Unlike Sparse Attention Vectors, SNs can identify discriminative neurons in shallow layers, enabling extreme early exiting from the first layer at the first generated token.

AIBullisharXiv – CS AI · Mar 117/10
🧠

BiCLIP: Domain Canonicalization via Structured Geometric Transformation

Researchers introduce BiCLIP, a new framework that improves vision-language models' ability to adapt to specialized domains through geometric transformations. The approach achieves state-of-the-art results across 11 benchmarks while maintaining simplicity and low computational requirements.

AIBearisharXiv – CS AI · Mar 67/10
🧠

Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

Researchers discovered a new vulnerability in multimodal large language models where specially crafted images can cause significant performance degradation by inducing numerical instability during inference. The attack method was validated on major vision-language models including LLaVa, Idefics3, and SmolVLM, showing substantial performance drops even with minimal image modifications.

Page 1 of 6Next →