SAGE: An Expert-Annotated South Asian GI Endoscopy Dataset for Multimodal Learning and Hallucination Analysis
Researchers introduce SAGE, a South Asian GI endoscopy dataset with 1,300 expert-annotated images designed to address geographic bias in medical AI models. Benchmarking reveals existing AI models suffer significant performance degradation on South Asian data, with task-specific classifiers dropping 58% in accuracy and multimodal models showing substantial accuracy losses in clinical detection tasks.
The SAGE dataset addresses a critical blind spot in medical AI development: the absence of diverse geographic representation in training data. While AI-assisted diagnosis shows tremendous potential for resource-limited healthcare settings, the field has built diagnostic systems almost exclusively on European datasets, creating tools optimized for populations they were never tested on. This research exposes how severely geographic bias affects model reliability through rigorous benchmarking across South Asian populations.
The performance gaps documented are substantial and clinically significant. A 58% accuracy drop in multi-class classification represents the difference between a useful diagnostic aid and a potentially dangerous tool. For anatomical landmark detection in large multimodal models, GREEN scores fell to 0.308βfar below clinical utility thresholds. These results suggest that models showing strong performance on Western datasets may fail precisely where they're needed most: in underserved regions with limited specialist availability.
The dataset itself enables multiple research directions simultaneously. By including image captions, hallucination tags, and question-answer pairs, SAGE supports training across diverse tasks from classification to visual reasoning. This versatility accelerates development of region-specific models while enabling systematic study of how demographic factors influence AI behavior.
Looking ahead, this work establishes a template for geographic inclusivity in medical AI. The substantial performance drops should motivate development of either region-specific models or fundamentally different training approaches that don't rely on geographic dominance. Healthcare organizations in South Asia and researchers focused on medical AI equity now have a benchmark dataset to validate solutions addressing these documented gaps.
- βExisting GI diagnostic AI models show 58% accuracy degradation on South Asian populations, indicating severe geographic bias in training data
- βSAGE dataset enables benchmarking across classification, image captioning, and VQA tasks with 1,300 expert-annotated images and 14,726 QA pairs
- βLarge multimodal models achieve only 0.308 GREEN score for anatomical detection and 0.410 for abnormality detection on South Asian endoscopy images
- βGeographic representation gaps in medical AI datasets create tools poorly suited for healthcare systems most constrained by specialist scarcity
- βMulti-task dataset design with hallucination tags supports both model development and evaluation of AI reliability in clinical contexts