From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model
Researchers propose a data-efficient framework to convert generative Multimodal Large Language Models into universal embedding models without extensive pre-training. The method uses hierarchical embedding prompts and Self-aware Hard Negative Sampling to achieve competitive performance on embedding benchmarks using minimal training data.