y0news
AnalyticsDigestsSourcesRSSAICrypto
#cross-modal-retrieval1 article
1 articles
AIBullisharXiv โ€“ CS AI ยท Feb 276/106
๐Ÿง 

StruXLIP: Enhancing Vision-language Models with Multimodal Structural Cues

StruXLIP is a new fine-tuning paradigm for vision-language models that uses edge maps and structural cues to improve cross-modal retrieval performance. The method augments standard CLIP training with three structure-centric losses to achieve more robust vision-language alignment by maximizing mutual information between multimodal structural representations.