y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#image-captioning News & Analysis

5 articles tagged with #image-captioning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AIBullisharXiv – CS AI · May 116/10
🧠

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Researchers introduce BalCapRL, a reinforcement learning framework that improves multimodal image captioning by balancing three competing objectives: utility-aware correctness, reference coverage, and linguistic quality. The method achieves significant performance gains across multiple models by applying reward-decoupled normalization and length-conditional masking, addressing the trade-offs present in existing captioning approaches.

AIBullisharXiv – CS AI · Apr 106/10
🧠

Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

Researchers propose a Self-Validation Framework to address object hallucination in Large Vision Language Models (LVLMs), where models generate descriptions of non-existent objects in images. The training-free approach validates object existence through language-prior-free verification and achieves 65.6% improvement on benchmark metrics, suggesting a novel path to enhance LVLM reliability without additional training.

AIBearisharXiv – CS AI · Mar 37/107
🧠

CaptionFool: Universal Image Captioning Model Attacks

Researchers have developed CaptionFool, a universal adversarial attack that can manipulate AI image captioning models by modifying just 1.2% of image patches. The attack achieves 94-96% success rates in forcing models to generate arbitrary captions, including offensive content that can bypass content moderation systems.

AIBullisharXiv – CS AI · Mar 35/105
🧠

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Researchers developed Cross-modal Identity Mapping (CIM), a reinforcement learning framework that improves image captioning in Large Vision-Language Models by minimizing information loss during visual-to-text conversion. The method achieved 20% improvement in relation reasoning on the COCO-LN500 benchmark using Qwen2.5-VL-7B without requiring additional annotations.

AINeutralLil'Log (Lilian Weng) · Jun 94/10
🧠

Generalized Visual Language Models

The article discusses generalized visual language models that can process images to generate text for tasks like image captioning and visual question-answering. The focus is specifically on extending pre-trained language models to handle visual inputs, rather than traditional object detection-based approaches.