Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
Researchers propose Reroute, a training-free method that improves vision-language model efficiency by recoverable token routing instead of permanent token removal. The approach dynamically reroutes less important visual tokens through decoder layers rather than discarding them, improving performance on grounding tasks while maintaining computational efficiency.
Vision-language models face a fundamental efficiency challenge: processing hundreds or thousands of visual tokens creates substantial computational overhead and memory consumption during inference. Current approaches use a rank-and-remove strategy, permanently discarding tokens deemed less important based on early-layer analysis. However, this irreversible pruning strategy overlooks a critical insight: token importance is not static across decoder depths. Tokens ranked as low-priority early in processing may become crucial for grounding-sensitive queries in later layers, making permanent removal suboptimal.
Reroute addresses this limitation through recoverable routing, a paradigm shift in how the field approaches token reduction. Rather than deleting tokens, the method defers them to a candidate pool where they remain accessible at subsequent routing decision points. This approach preserves the theoretical computational budget and memory constraints of existing pruning methods while recovering performance on grounding tasks. The method operates as a training-free plug-in, making it easily compatible with existing token-reduction techniques like FastV, PDrop, and Nüwa across different model architectures including LLaVA-1.5 and Qwen.
The research demonstrates meaningful improvements in grounding accuracy under aggressive token reduction scenarios while maintaining general visual question-answering performance. This represents progress toward more efficient multimodal AI systems without sacrificing capability on specialized tasks. The broader implication suggests that VLM optimization should reconsider token reduction as a dynamic routing problem rather than static pruning, potentially opening new efficiency strategies for production deployments where both general capability and task-specific performance matter.
- →Reroute replaces permanent token removal with recoverable routing, allowing deferred tokens to re-enter consideration at later decoder layers.
- →The method improves grounding performance under aggressive token reduction while maintaining visual question-answering accuracy.
- →Token importance varies significantly across decoder depth, making static pruning strategies suboptimal for multimodal models.
- →Reroute operates as a training-free plug-in compatible with existing token-reduction methods without changing their computational budgets.
- →The approach suggests vision-language model optimization should treat token reduction as dynamic routing rather than irreversible pruning.