Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)
The EReL@MIR 2025 Multimodal Document Retrieval Challenge invited teams to build retrieval systems handling both closed-set document page retrieval and open-domain Wikipedia passage retrieval from text and image queries. The competition attracted 22 teams with 586 submissions, with winning systems favoring decoder-based Multimodal-LLM embedders over traditional CLIP-style encoders.
The EReL@MIR 2025 challenge addresses a critical gap in information retrieval technology: most current systems discard visual information despite documents increasingly combining text, figures, tables, and charts. This competition pushes the field forward by requiring participants to develop unified systems capable of handling two distinct retrieval scenarios simultaneously—document-level retrieval within long texts and open-domain passage retrieval from visual or multimodal queries. The high participation rate of 455 entrants across 22 teams demonstrates strong industry interest in solving multimodal retrieval problems.
The convergence on Qwen2-VL embedders among top performers signals a meaningful shift in retrieval architecture preferences. Rather than relying on CLIP-style vision-language encoders that treat images and text as separate modalities, winning teams leveraged decoder-based Multimodal-LLM embedders that natively integrate visual and textual understanding. This architectural choice reflects broader trends in AI toward unified transformer-based models capable of genuine multimodal reasoning.
The competitive results reveal important insights about practical trade-offs in system design. The top three teams achieved dramatically different approaches—fine-tuned ensembles, training-free fusion with re-ranking, and zero-shot late interaction—yet produced comparable performance levels. Most notably, the training-free system finished within 0.1 points of the fine-tuned winner, suggesting that intelligent architectural design can match or approach supervised optimization without expensive training cycles. This finding has significant implications for practitioners seeking to deploy multimodal retrievers in resource-constrained environments.
The challenge establishes important benchmarks for multimodal document retrieval that will guide future development in retrieval-augmented generation systems, particularly as language models increasingly integrate vision capabilities.
- →Winning systems converged on decoder-based Multimodal-LLM embedders rather than CLIP-style encoders for multimodal document retrieval
- →A training-free approach achieved performance within 0.1 points of fully fine-tuned ensembles, indicating efficient architectures can match supervised methods
- →The challenge addressed the critical gap of visual information handling in document retrieval systems supporting RAG applications
- →455 entrants across 22 teams demonstrates strong industry focus on solving unified multimodal retrieval problems
- →Zero-shot and training-free methods proved competitive with fine-tuned approaches, reducing deployment friction for multimodal systems