AVIS: Adaptive Test-Time Scaling for Vision-Language Models
Researchers introduce AVIS, a lightweight adaptive policy that optimizes inference efficiency in Vision-Language Models by jointly scaling visual context and reasoning computation. The method uses token pruning and difficulty prediction to reduce computational costs while maintaining or improving accuracy across image and video reasoning tasks.
Vision-Language Models have demonstrated impressive capabilities through chain-of-thought prompting and test-time scaling, but these advances typically demand substantial computational resources during inference. AVIS addresses this fundamental efficiency challenge by treating inference optimization as a two-dimensional problem: controlling visual token quantity (Visual Context Scaling) and reasoning depth (Visual Reasoning Scaling). Rather than optimizing these dimensions independently, the research proposes coordinated adaptation per query, allowing models to allocate compute where it provides the greatest benefit. The Key Diversity Visual pruning mechanism removes redundant visual tokens before processing using a training-free approach with linear computational complexity, while adaptive self-consistency leverages learned difficulty prediction to determine appropriate reasoning rollout counts. This architecture offers practical deployment advantages through compatibility with shared-prefill inference, enabling multiple reasoning chains to reuse cached computations. The approach maintains effectiveness even when applied to reinforcement-learning post-trained models, suggesting broad applicability across the VLM landscape. The research demonstrates consistent improvements in accuracy-compute tradeoffs across diverse benchmarks, addressing a critical pain point for real-world VLM deployment. As model sizes continue growing and visual input complexity increases, efficient inference mechanisms become essential for practical applications. AVIS represents incremental but meaningful progress toward making advanced VLM capabilities accessible without prohibitive computational overhead, particularly relevant as edge deployment and cost-sensitive applications become increasingly important in the AI stack.
- βAVIS jointly optimizes visual token pruning and reasoning scaling to reduce VLM inference costs while maintaining accuracy
- βKey Diversity Visual pruning removes redundant tokens in O(N) time without requiring model retraining
- βLearned difficulty prediction enables adaptive selection of reasoning rollout counts per query
- βThe method integrates with shared-prefill inference architecture for improved computational efficiency
- βEffectiveness demonstrated across image and video reasoning tasks including RL post-trained models