AINeutralarXiv โ CS AI ยท 5h ago1
๐ง
ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
Researchers propose ITO, a new framework for image-text representation learning that addresses modality gaps through multimodal alignment and training-time fusion. The method outperforms existing baselines across classification, retrieval, and multimodal benchmarks while maintaining efficiency by discarding the fusion module during inference.