y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#image-captioning News & Analysis

4 articles tagged with #image-captioning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles
AIBullisharXiv โ€“ CS AI ยท Apr 106/10
๐Ÿง 

Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

Researchers propose a Self-Validation Framework to address object hallucination in Large Vision Language Models (LVLMs), where models generate descriptions of non-existent objects in images. The training-free approach validates object existence through language-prior-free verification and achieves 65.6% improvement on benchmark metrics, suggesting a novel path to enhance LVLM reliability without additional training.

AIBearisharXiv โ€“ CS AI ยท Mar 37/107
๐Ÿง 

CaptionFool: Universal Image Captioning Model Attacks

Researchers have developed CaptionFool, a universal adversarial attack that can manipulate AI image captioning models by modifying just 1.2% of image patches. The attack achieves 94-96% success rates in forcing models to generate arbitrary captions, including offensive content that can bypass content moderation systems.

AIBullisharXiv โ€“ CS AI ยท Mar 35/105
๐Ÿง 

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Researchers developed Cross-modal Identity Mapping (CIM), a reinforcement learning framework that improves image captioning in Large Vision-Language Models by minimizing information loss during visual-to-text conversion. The method achieved 20% improvement in relation reasoning on the COCO-LN500 benchmark using Qwen2.5-VL-7B without requiring additional annotations.

AINeutralLil'Log (Lilian Weng) ยท Jun 94/10
๐Ÿง 

Generalized Visual Language Models

The article discusses generalized visual language models that can process images to generate text for tasks like image captioning and visual question-answering. The focus is specifically on extending pre-trained language models to handle visual inputs, rather than traditional object detection-based approaches.