#image-captioning News & Analysis

6 articles tagged with #image-captioning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

6 articles

AINeutralarXiv – CS AI · Jun 236/10

🧠

MIRCaps: A Large-Scale Mixed-Domain Dataset with Image-Level and Region-Level Captions for Fine-Grained Vision-Language Learning

Researchers introduce MIRCaps, a large-scale multimodal dataset containing 141,364 images with 981,947 image-level and 1,742,264 region-level captions designed to improve Vision-Language Models (VLMs) for general imagery and CCTV surveillance applications. The dataset demonstrates effective fine-tuning of lightweight VLMs across image captioning and object detection tasks, with code and data publicly available.

AIBullisharXiv – CS AI · May 116/10

🧠

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Researchers introduce BalCapRL, a reinforcement learning framework that improves multimodal image captioning by balancing three competing objectives: utility-aware correctness, reference coverage, and linguistic quality. The method achieves significant performance gains across multiple models by applying reward-decoupled normalization and length-conditional masking, addressing the trade-offs present in existing captioning approaches.

AIBullisharXiv – CS AI · Apr 106/10

🧠

Countering the Over-Reliance Trap: Mitigating Object Hallucination for LVLMs via a Self-Validation Framework

Researchers propose a Self-Validation Framework to address object hallucination in Large Vision Language Models (LVLMs), where models generate descriptions of non-existent objects in images. The training-free approach validates object existence through language-prior-free verification and achieves 65.6% improvement on benchmark metrics, suggesting a novel path to enhance LVLM reliability without additional training.

AIBearisharXiv – CS AI · Mar 37/107

🧠

CaptionFool: Universal Image Captioning Model Attacks

Researchers have developed CaptionFool, a universal adversarial attack that can manipulate AI image captioning models by modifying just 1.2% of image patches. The attack achieves 94-96% success rates in forcing models to generate arbitrary captions, including offensive content that can bypass content moderation systems.

AIBullisharXiv – CS AI · Mar 35/105

🧠

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Researchers developed Cross-modal Identity Mapping (CIM), a reinforcement learning framework that improves image captioning in Large Vision-Language Models by minimizing information loss during visual-to-text conversion. The method achieved 20% improvement in relation reasoning on the COCO-LN500 benchmark using Qwen2.5-VL-7B without requiring additional annotations.

AINeutralLil'Log (Lilian Weng) · Jun 94/10

🧠

Generalized Visual Language Models

The article discusses generalized visual language models that can process images to generate text for tasks like image captioning and visual question-answering. The focus is specifically on extending pre-trained language models to handle visual inputs, rather than traditional object detection-based approaches.