y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#visual-grounding News & Analysis

16 articles tagged with #visual-grounding. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

16 articles
AIBullisharXiv – CS AI · May 277/10
🧠

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding

Researchers introduce LocateAnything, a new vision-language model framework that uses Parallel Box Decoding to detect and localize objects simultaneously rather than sequentially, improving both inference speed and accuracy. The team curated a 138-million-sample dataset and demonstrated significant performance improvements across multiple benchmarks.

AINeutralarXiv – CS AI · Apr 157/10
🧠

Distorted or Fabricated? A Survey on Hallucination in Video LLMs

Researchers have conducted a comprehensive survey on hallucinations in Video Large Language Models (Vid-LLMs), identifying two core types—dynamic distortion and content fabrication—and their root causes in temporal representation limitations and insufficient visual grounding. The study reviews evaluation benchmarks, mitigation strategies, and proposes future directions including motion-aware encoders and counterfactual learning to improve reliability.

AIBullisharXiv – CS AI · Apr 107/10
🧠

Asking like Socrates: Socrates helps VLMs understand remote sensing images

Researchers introduce RS-EoT (Remote Sensing Evidence-of-Thought), a novel framework that enables vision-language models to reason more effectively about satellite imagery by iteratively seeking visual evidence rather than relying on linguistic patterns. The approach uses a self-play multi-agent system called SocraticAgent and reinforcement learning to address the 'Glance Effect,' where models superficially analyze large-scale remote sensing images, achieving state-of-the-art performance on multiple benchmarks.

AINeutralarXiv – CS AI · Mar 177/10
🧠

How Do Medical MLLMs Fail? A Study on Visual Grounding in Medical Images

Researchers identified that medical multimodal large language models (MLLMs) fail primarily due to inadequate visual grounding capabilities when analyzing medical images, unlike their success with natural scenes. They developed VGMED evaluation dataset and proposed VGRefine method, achieving state-of-the-art performance across 6 medical visual question-answering benchmarks without additional training.

AINeutralarXiv – CS AI · 5d ago6/10
🧠

Mitigating Perceptual Judgment Bias in Multimodal LLM-as-a-Judge via Perceptual Perturbation and Reward Modeling

Researchers identify and address Perceptual Judgment Bias in multimodal large language models used as automated evaluators, where these models favor plausible narratives over visually accurate answers when text and images conflict. The team develops a training framework using perceptually perturbed datasets and reward modeling that improves MLLM judges' visual grounding and evaluation consistency across benchmarks.

AINeutralarXiv – CS AI · 6d ago6/10
🧠

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

Researchers introduce a structured visual perturbation framework to analyze how Vision-Language-Action (VLA) models ground their autonomous driving decisions in visual information. The study reveals uneven visual dependency across different abstraction levels, highlighting the need for better diagnostic tools to ensure safer, more robust autonomous driving systems.

AINeutralarXiv – CS AI · May 296/10
🧠

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Researchers propose a unified framework for long-form egocentric video understanding that separates reasoning into semantic and visual evidence streams, achieving competitive results on the HD-EPIC-VQA benchmark. The approach addresses fundamental limitations in how multimodal language models process extended video content by combining procedural structure extraction with fine-grained object grounding.

AIBearisharXiv – CS AI · May 286/10
🧠

Reading or Guessing? Visual Grounding Failures of Vision-Language Models for OCR in Ancient Greek Editions

Researchers demonstrate that Vision-Language Models (VLMs) used for optical character recognition produce fluent but visually unsupported text, relying heavily on language priors rather than actual image content. Testing on Ancient Greek critical editions reveals VLMs generate plausible errors while traditional OCR produces local noise, with token-level grounding analysis showing model-specific vulnerabilities to hallucination.

AINeutralarXiv – CS AI · May 276/10
🧠

Advancing Creative Physical Intelligence in Large Multimodal Models

Researchers introduce MM-CreativityBench, a benchmark testing whether large multimodal models can solve creative physical problems by identifying non-obvious tool uses in constrained environments. Current LMMs struggle not from lack of generation capability but from poor visual grounding, hallucinating attributes and overlooking relevant entities; the team proposes affordance-grounded alignment using preference learning to improve performance.

AINeutralarXiv – CS AI · May 126/10
🧠

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

Researchers introduce SeePhys Pro, a benchmark revealing that advanced AI models significantly degrade in physics reasoning when visual information replaces text, with visual grounding as the primary failure point. The study further demonstrates that multimodal reinforcement learning improvements can stem from non-visual textual cues rather than genuine visual understanding, challenging current evaluation methodologies.

AINeutralarXiv – CS AI · May 126/10
🧠

Investigating Anisotropy in Visual Grounding under Controlled Counterfactual Perturbations

Researchers investigate why visual grounding models fail when image captions are semantically mismatched, hypothesizing that embedding anisotropy may be responsible. Testing two transformer-based models with different embedding geometries reveals no meaningful correlation between cosine similarity and approximation errors, suggesting the problem requires investigation of deeper geometric properties.

AIBullisharXiv – CS AI · May 126/10
🧠

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

Researchers introduce COAST, a novel pruning framework for vision-language models that reduces visual tokens by 77.8% while maintaining 98.64% performance and achieving 2.15x speedup. Unlike existing methods that discard low-attention tokens, COAST uses adaptive semantic routing to preserve contextually essential information, preventing 'Visual Aphasia'—a failure mode where models lose visual grounding.

AIBullisharXiv – CS AI · May 46/10
🧠

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

Researchers propose Persistent Visual Memory (PVM), a lightweight module that addresses visual signal degradation in Large Vision-Language Models by maintaining consistent visual perception during long text generation. Integrated into Qwen3-VL models, PVM demonstrates measurable accuracy improvements with minimal computational overhead, particularly benefiting complex reasoning tasks.

AINeutralarXiv – CS AI · May 16/10
🧠

Flattery in Motion: Benchmarking and Analyzing Sycophancy in Video-LLMs

Researchers introduce VISE, the first benchmark for evaluating sycophancy in video large language models (Video-LLMs), where models incorrectly agree with user inputs that contradict visual evidence. The study proposes two training-free mitigation strategies: enhanced visual grounding through keyframe selection and inference-time neural representation steering, addressing a critical reliability gap in multimodal AI systems.

AIBearisharXiv – CS AI · Apr 76/10
🧠

Don't Blink: Evidence Collapse during Multimodal Reasoning

Research reveals that Vision Language Models (VLMs) progressively lose visual grounding during reasoning tasks, creating dangerous low-entropy predictions that appear confident but lack visual evidence. The study found attention to visual evidence drops by over 50% during reasoning across multiple benchmarks, requiring task-aware monitoring for safe AI deployment.

AINeutralarXiv – CS AI · Mar 45/103
🧠

See and Remember: A Multimodal Agent for Web Traversal

Researchers developed V-GEMS, a new multimodal AI agent architecture that improves web navigation by combining visual grounding with explicit memory systems. The system achieved a 28.7% performance improvement over existing baselines by preventing navigation loops and enabling better backtracking through structured path mapping.