160 articles tagged with #vision-language-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv – CS AI · 1d ago7/10
🧠Researchers introduce CropVLM, a reinforcement learning-based method that enables Vision-Language Models to dynamically focus on relevant image regions for improved fine-grained understanding tasks. The approach works with existing VLMs without modification and demonstrates significant performance gains on text recognition and document analysis without requiring human-labeled training data.
AINeutralarXiv – CS AI · 1d ago7/10
🧠Researchers introduce VLM-DeflectionBench, a new benchmark with 2,775 samples designed to evaluate how large vision-language models handle conflicting or insufficient evidence. The study reveals that most state-of-the-art LVLMs fail to appropriately deflect when faced with noisy or misleading information, highlighting critical gaps in model reliability for knowledge-intensive tasks.
AIBullisharXiv – CS AI · 1d ago7/10
🧠Researchers introduce Ariadne, a framework demonstrating that Reinforcement Learning with Verifiable Rewards (RLVR) expands spatial reasoning capabilities in Vision-Language Models beyond their base distribution. Testing on synthetic mazes and real-world navigation benchmarks shows the technique enables models to solve previously unsolvable problems, suggesting genuine capability expansion rather than sampling efficiency.
AIBearisharXiv – CS AI · 1d ago7/10
🧠Researchers introduce MemJack, a multi-agent framework that exploits semantic vulnerabilities in Vision-Language Models through coordinated jailbreak attacks, achieving 71.48% attack success rates against Qwen3-VL-Plus. The study reveals that current VLM safety measures fail against sophisticated visual-semantic attacks and introduces MemJack-Bench, a dataset of 113,000+ attack trajectories to advance defensive research.
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers have conducted a comprehensive study examining how large vision-language models (LVLMs) exhibit cultural stereotypes and biases when making judgments about people's moral, ethical, and political values based on cultural context cues in images. Using counterfactual image sets and Moral Foundations Theory, the analysis across five popular LVLMs reveals significant concerns about AI fairness beyond traditional social biases, with implications for deployed AI systems used globally.
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers evaluated domain-specific fine-tuning of vision-language models (VLMs) on medical imaging tasks and found that performance degrades significantly with task complexity, with medical fine-tuning providing no consistent advantage. The study reveals that these models exhibit fragility and high sensitivity to prompt variations, questioning the reliability of VLMs for high-stakes medical applications.
🧠 GPT-5
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers demonstrate that AI model logits and other accessible model outputs leak significant task-irrelevant information from vision-language models, creating potential security risks through unintentional or malicious information exposure despite apparent safeguards.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers propose Risk Awareness Injection (RAI), a lightweight, training-free framework that enhances vision-language models' ability to recognize unsafe content by amplifying risk signals in their feature space. The method maintains model utility while significantly reducing vulnerability to multimodal jailbreak attacks, addressing a critical security gap in VLMs.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers demonstrate that variational Bayesian methods significantly improve Vision Language Models' reliability for Visual Question Answering tasks by enabling selective prediction with reduced hallucinations and overconfidence. The proposed Variational VQA approach shows particular strength at low error tolerances and offers a practical path to making large multimodal models safer without proportional computational costs.
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce Grid2Matrix, a benchmark that reveals fundamental limitations in Vision-Language Models' ability to accurately process and describe visual details in grids. The study identifies a critical gap called 'Digital Agnosia'—where visual encoders preserve grid information that fails to translate into accurate language outputs—suggesting that VLM failures stem not from poor vision encoding but from the disconnection between visual features and linguistic expression.
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce HAERAE-Vision, a benchmark of 653 real-world underspecified visual questions from Korean online communities, revealing that state-of-the-art vision-language models achieve under 50% accuracy on natural queries despite performing well on structured benchmarks. The study demonstrates that query clarification alone improves performance by 8-22 points, highlighting a critical gap between current evaluation standards and real-world deployment requirements.
🧠 GPT-5🧠 Gemini
AIBullisharXiv – CS AI · 2d ago7/10
🧠SVD-Prune introduces a training-free token pruning method for Vision-Language Models using Singular Value Decomposition to reduce computational overhead. The approach maintains model performance while drastically reducing vision tokens to 16-32, addressing efficiency challenges in multimodal AI systems without requiring retraining.
AIBearisharXiv – CS AI · 2d ago7/10
🧠Researchers present Edu-MMBias, a comprehensive framework for detecting social biases in Vision-Language Models used in educational settings. The study reveals that VLMs exhibit compensatory class bias while harboring persistent health and racial stereotypes, and critically, that visual inputs bypass text-based safety mechanisms to trigger hidden biases.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce TARAC, a training-free framework that mitigates hallucinations in Large Vision-Language Models by dynamically preserving visual attention across generation steps. The method achieves significant improvements—reducing hallucinated content by 25.2% and boosting perception scores by 10.65—while adding only ~4% computational overhead, making it practical for real-world deployment.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce a listener-augmented reinforcement learning framework for training vision-language models to better align with human visual preferences. By using an independent frozen model to evaluate and validate reasoning chains, the approach achieves 67.4% accuracy on ImageReward benchmarks and demonstrates significant improvements in out-of-distribution generalization.
🏢 Hugging Face
AIBullisharXiv – CS AI · 6d ago7/10
🧠Q-Zoom is a new framework that improves the efficiency of multimodal large language models by intelligently processing high-resolution visual inputs. Using adaptive query-aware perception, the system achieves 2.5-4.4x faster inference speeds on document and high-resolution tasks while maintaining or exceeding baseline accuracy across multiple MLLM architectures.
AIBearisharXiv – CS AI · 6d ago7/10
🧠Researchers introduce the Graded Color Attribution dataset to test whether Vision-Language Models faithfully follow their own stated reasoning rules. The study reveals that VLMs systematically violate their introspective rules in up to 60% of cases, while humans remain consistent, suggesting VLM self-knowledge is fundamentally miscalibrated with serious implications for high-stakes deployment.
🧠 GPT-5
AIBullisharXiv – CS AI · 6d ago7/10
🧠Researchers propose Faithful-First RPA, a framework that improves multimodal AI reasoning by prioritizing faithfulness to visual evidence. The method uses FaithEvi for supervision and FaithAct for execution, achieving up to 24% improvement in perceptual faithfulness without sacrificing task accuracy.
AIBullisharXiv – CS AI · 6d ago7/10
🧠Researchers introduce Perception-Grounded Policy Optimization (PGPO), a novel fine-tuning framework that improves how large vision-language models learn from visual inputs by strategically allocating learning signals to vision-dependent tokens rather than treating all tokens equally. Testing on the Qwen2.5-VL series demonstrates an average 18.7% performance boost across multimodal reasoning benchmarks.
AIBullisharXiv – CS AI · 6d ago7/10
🧠Researchers introduce SAVANT, a model-agnostic framework that improves Vision Language Models' ability to detect semantic anomalies in autonomous driving scenarios by 18.5% through structured reasoning instead of ad hoc prompting. The team used this approach to label 10,000 real-world images and fine-tuned an open-source 7B model achieving 90.8% recall, demonstrating practical deployment feasibility without proprietary model dependency.
AIBullisharXiv – CS AI · 6d ago7/10
🧠Researchers introduce RS-EoT (Remote Sensing Evidence-of-Thought), a novel framework that enables vision-language models to reason more effectively about satellite imagery by iteratively seeking visual evidence rather than relying on linguistic patterns. The approach uses a self-play multi-agent system called SocraticAgent and reinforcement learning to address the 'Glance Effect,' where models superficially analyze large-scale remote sensing images, achieving state-of-the-art performance on multiple benchmarks.
AIBullisharXiv – CS AI · 6d ago7/10
🧠Researchers introduce SALLIE, a lightweight runtime defense framework that detects and mitigates jailbreak attacks and prompt injections in large language and vision-language models simultaneously. Using mechanistic interpretability and internal model activations, SALLIE achieves robust protection across multiple architectures without degrading performance or requiring architectural changes.
AIBearisharXiv – CS AI · 6d ago7/10
🧠Researchers have discovered a new attack vulnerability in mobile vision-language agents where malicious prompts remain invisible to human users but are triggered during autonomous agent interactions. Using an optimization method called HG-IDA*, attackers can achieve 82.5% planning and 75.0% execution hijack rates on GPT-4o by exploiting the lack of touch signals during agent operations, exposing a critical security gap in deployed mobile AI systems.
🧠 GPT-4
AINeutralarXiv – CS AI · Apr 77/10
🧠Researchers developed SpectrumQA, a benchmark comparing vision-language models (VLMs) and CNNs for spectrum management in satellite-terrestrial networks. The study reveals task-dependent complementarity: CNNs excel at spatial localization while VLMs uniquely enable semantic reasoning capabilities that CNNs lack entirely.
AIBullisharXiv – CS AI · Apr 77/10
🧠Researchers have developed a neuro-symbolic framework that enables robots to learn complex manipulation tasks from as few as one demonstration, without requiring manual programming or large datasets. The system uses Vision-Language Models to automatically construct symbolic planning domains and has been validated on real industrial equipment including forklifts and robotic arms.