Researchers developed MSUE, a multi-expert question-answering system that achieved 0.95 accuracy in the 2026 SoccerNet VQA Challenge by combining vision-language models, large language models, and specialized experts. The solution uses an LLM router to dynamically dispatch questions to text, image, and video processing experts, demonstrating advances in multi-modal AI for domain-specific tasks.
This research demonstrates significant progress in multi-modal AI systems capable of understanding complex sports video content. The MSUE architecture represents a practical approach to domain-specific question answering, combining multiple specialized models rather than relying on a single foundation model. The dynamic routing mechanism using an LLM as a coordinator shows how different AI modalities can be effectively orchestrated for improved performance.
The data synthesis pipeline driven by Vision-Language Models addresses a fundamental challenge in machine learning: generating diverse, high-quality training data at scale. By systematically restructuring raw domain data into varied VQA samples with both concise and long-form responses, the researchers created a more robust training foundation. This approach has implications beyond soccer analysis, suggesting a scalable methodology for other specialized domains lacking abundant labeled data.
The third-place finish with 0.95 accuracy indicates the maturity of current multi-modal AI techniques while revealing remaining challenges in achieving perfect performance. The combination of Gemini3-Flash for text understanding, fine-tuned Qwen3-VL for vision capabilities, and external knowledge bases reflects how modern AI systems increasingly rely on ensemble approaches rather than single unified models.
For the broader AI industry, this work validates the effectiveness of modular architectures and expert specialization. As organizations develop AI systems for specific domains, the MSUE methodology provides a practical template. The emphasis on cost-effectiveness suggests growing attention to efficient AI deployment, particularly relevant as computational demands scale.
- βMSUE achieved 0.95 accuracy on SoccerNet VQA Challenge, ranking third through dynamic expert routing architecture.
- βVision-Language Model-driven data synthesis pipeline creates diverse training samples from raw domain data efficiently.
- βMulti-expert ensemble combining text, image, and video specialists outperforms single unified model approaches.
- βLLM routing mechanism effectively coordinates between specialized experts for domain-specific question answering.
- βCost-effective methodology suggests scalable approach for deploying AI systems in specialized domains.