Researchers introduce ARMS, a router system designed to intelligently select among multiple vision-language models based on input queries. The 800M-parameter system matches or exceeds GPT-4o's selection accuracy while offering efficiency benefits, addressing the practical challenge of VLM selection across diverse applications.
The release of ARMS represents a meaningful advancement in practical AI infrastructure, solving a tangible deployment problem that becomes increasingly relevant as VLM options proliferate. Currently, users and developers face decision paralysis when choosing between vision-language models with different performance characteristics, costs, and resource requirements. ARMS automates this selection process by learning patterns in how different models perform on specific query types.
This work builds on emerging research into the performance paradox—where larger or more sophisticated models don't uniformly outperform alternatives across all scenarios. Rather than forcing users toward single monolithic solutions, ARMS enables dynamic routing to the most appropriate model per query. The research addresses three critical barriers: creating specialized training data (32,626 image-text queries), developing effective multimodal representations, and enabling flexible adaptation to new models through incremental and independent training strategies.
The practical implications are significant for production environments. Organizations deploying multiple VLMs can now optimize cost-quality tradeoffs by routing simple queries to efficient models while reserving expensive inference for complex cases. ARMS's modest 800M size enables deployment as a lightweight decision layer, adding minimal latency overhead. The system's demonstrated ability to adapt to new VLMs through alternative training strategies suggests long-term viability despite the constantly evolving model landscape.
Looking forward, this approach may inspire similar routers for other model categories (language models, image generators, embeddings). The reproducible research artifacts enable community validation and potential commercial implementations, potentially influencing how companies architect multi-model inference pipelines.
- →ARMS enables intelligent routing between vision-language models, solving the practical selection problem faced by deployment teams
- →The 800M-parameter router matches GPT-4o performance at a fraction of the computational cost and scale
- →Two training strategies (incremental and independent) allow ARMS to adapt to new VLMs without full retraining
- →A new multimodal dataset of 32,626 queries across seven VLMs provides foundation for VLM selection research
- →Dynamic model routing optimizes cost-quality tradeoffs by matching query complexity to appropriate model capabilities