🧠 AI⚪ NeutralImportance 6/10

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

arXiv – CS AI|Ahmadreza Jeddi, Minh Ngoc Le, Amirhossein Kazerouni, Hakki Can Karaimer, Hue Nguyen, Iqbal Mohomed, Michael Brudno, Alex Levinshtein, Konstantinos G. Derpanis, Babak Taati, Radek Grzeszczuk|June 11, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce AVIS, a lightweight adaptive policy that optimizes inference efficiency in Vision-Language Models by jointly scaling visual context and reasoning computation. The method uses token pruning and difficulty prediction to reduce computational costs while maintaining or improving accuracy across image and video reasoning tasks.

Analysis

Vision-Language Models have demonstrated impressive capabilities through chain-of-thought prompting and test-time scaling, but these advances typically demand substantial computational resources during inference. AVIS addresses this fundamental efficiency challenge by treating inference optimization as a two-dimensional problem: controlling visual token quantity (Visual Context Scaling) and reasoning depth (Visual Reasoning Scaling). Rather than optimizing these dimensions independently, the research proposes coordinated adaptation per query, allowing models to allocate compute where it provides the greatest benefit. The Key Diversity Visual pruning mechanism removes redundant visual tokens before processing using a training-free approach with linear computational complexity, while adaptive self-consistency leverages learned difficulty prediction to determine appropriate reasoning rollout counts. This architecture offers practical deployment advantages through compatibility with shared-prefill inference, enabling multiple reasoning chains to reuse cached computations. The approach maintains effectiveness even when applied to reinforcement-learning post-trained models, suggesting broad applicability across the VLM landscape. The research demonstrates consistent improvements in accuracy-compute tradeoffs across diverse benchmarks, addressing a critical pain point for real-world VLM deployment. As model sizes continue growing and visual input complexity increases, efficient inference mechanisms become essential for practical applications. AVIS represents incremental but meaningful progress toward making advanced VLM capabilities accessible without prohibitive computational overhead, particularly relevant as edge deployment and cost-sensitive applications become increasingly important in the AI stack.

Key Takeaways

→AVIS jointly optimizes visual token pruning and reasoning scaling to reduce VLM inference costs while maintaining accuracy
→Key Diversity Visual pruning removes redundant tokens in O(N) time without requiring model retraining
→Learned difficulty prediction enables adaptive selection of reasoning rollout counts per query
→The method integrates with shared-prefill inference architecture for improved computational efficiency
→Effectiveness demonstrated across image and video reasoning tasks including RL post-trained models