AIBullisharXiv โ CS AI ยท Feb 276/106
๐ง
StruXLIP: Enhancing Vision-language Models with Multimodal Structural Cues
StruXLIP is a new fine-tuning paradigm for vision-language models that uses edge maps and structural cues to improve cross-modal retrieval performance. The method augments standard CLIP training with three structure-centric losses to achieve more robust vision-language alignment by maximizing mutual information between multimodal structural representations.