y0news
#vision-language-models9 articles
9 articles
AIBullisharXiv – CS AI · 6h ago6
🧠

DesignSense: A Human Preference Dataset and Reward Modeling Framework for Graphic Layout Generation

Researchers introduce DesignSense-10k, a dataset of 10,235 human-annotated preference pairs for evaluating graphic layout generation, along with DesignSense, a specialized AI model that outperforms existing models by 54.6% in layout quality assessment. The framework addresses the gap between AI-generated layouts and human aesthetic preferences, showing practical improvements in layout generation through reinforcement learning.

AIBullisharXiv – CS AI · 6h ago5
🧠

3D Modality-Aware Pre-training for Vision-Language Model in MRI Multi-organ Abnormality Detection

Researchers developed MedMAP, a Medical Modality-Aware Pretraining framework that enhances vision-language models for 3D MRI multi-organ abnormality detection. The framework addresses challenges in modality-specific alignment and cross-modal feature fusion, demonstrating superior performance on a curated dataset of 7,392 3D MRI volume-report pairs.

AIBullisharXiv – CS AI · 6h ago2
🧠

See, Act, Adapt: Active Perception for Unsupervised Cross-Domain Visual Adaptation via Personalized VLM-Guided Agent

Researchers introduce Sea² (See, Act, Adapt), a novel approach that improves AI perception models in new environments by using an intelligent pose-control agent rather than retraining the models themselves. The method keeps perception modules frozen and uses a vision-language model as a controller, achieving significant performance improvements of 13-27% across visual tasks without requiring additional training data.

AIBullisharXiv – CS AI · 6h ago4
🧠

Interpretable Debiasing of Vision-Language Models for Social Fairness

Researchers have developed DeBiasLens, a new framework that uses sparse autoencoders to identify and deactivate social bias neurons in Vision-Language models without degrading their performance. The model-agnostic approach addresses concerns about unintended social bias in VLMs by making the debiasing process interpretable and targeting internal model dynamics rather than surface-level fixes.

AIBullisharXiv – CS AI · 6h ago9
🧠

Quant Experts: Token-aware Adaptive Error Reconstruction with Mixture of Experts for Large Vision-Language Models Quantization

Researchers introduce Quant Experts (QE), a new post-training quantization technique for Vision-Language Models that uses adaptive error compensation with mixture-of-experts architecture. The method addresses computational and memory overhead issues by intelligently handling token-dependent and token-independent channels, maintaining performance comparable to full-precision models across 2B to 70B parameter scales.

AIBullisharXiv – CS AI · 6h ago5
🧠

Toward Guarantees for Clinical Reasoning in Vision Language Models via Formal Verification

Researchers developed a neurosymbolic verification framework to audit logical consistency in AI-generated radiology reports, addressing issues where vision-language models produce diagnostic conclusions unsupported by their findings. The system uses formal verification methods to identify hallucinations and missing logical conclusions in medical AI outputs, improving diagnostic accuracy.

AIBullisharXiv – CS AI · 6h ago12
🧠

Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Researchers developed Speculative Verdict (SV), a training-free framework that improves large Vision-Language Models' ability to reason over information-dense images by combining multiple small draft models with a larger verdict model. The approach achieves better accuracy on visual question answering benchmarks while reducing computational costs compared to large proprietary models.

AIBearisharXiv – CS AI · 6h ago6
🧠

FRIEDA: Benchmarking Multi-Step Cartographic Reasoning in Vision-Language Models

Researchers introduce FRIEDA, a new benchmark for testing cartographic reasoning in large vision-language models, revealing significant limitations. The best AI models achieve only 37-38% accuracy compared to 84.87% human performance on complex map interpretation tasks requiring multi-step spatial reasoning.