y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 6/10

Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models

arXiv – CS AI|Jie Ma, Yihang Liu, Zhike Qiu, Jiayi Ji, Xiaoshuai Sun|
🤖AI Summary

Researchers introduce COAST, a novel pruning framework for vision-language models that reduces visual tokens by 77.8% while maintaining 98.64% performance and achieving 2.15x speedup. Unlike existing methods that discard low-attention tokens, COAST uses adaptive semantic routing to preserve contextually essential information, preventing 'Visual Aphasia'—a failure mode where models lose visual grounding.

Analysis

Current vision-language model optimization relies on a flawed assumption: that tokens receiving low attention scores are redundant and can be safely removed. COAST challenges this paradigm by demonstrating that shallow attention metrics fail to capture tokens' importance for compositional reasoning tasks. The framework identifies a critical problem called Visual Aphasia, where aggressive pruning causes models to abandon visual grounding and rely solely on language priors, degrading performance on complex visual reasoning.

The research builds on growing concerns about LVLM efficiency. As these models scale, inference latency becomes prohibitive for real-world deployment. Previous pruning approaches used single-pass attention scoring, ignoring how token importance evolves across layers. COAST addresses this by using cross-modal attention entropy to estimate contextual dispersion and implementing contrastive routing scores that preserve both semantic evidence and spatial context.

The practical implications are substantial. Achieving 2.15x latency speedup with minimal performance loss (98.64% retention) makes vision-language models viable for resource-constrained environments—critical for edge deployment and real-time applications. The framework's training-free nature and generalization across multiple LVLM families suggest robust applicability rather than model-specific optimization.

Looking ahead, this work represents a shift toward nuanced compression strategies that respect model architecture rather than applying blanket pruning rules. Success across seven benchmarks and multiple token budgets indicates COAST could become a standard optimization layer for production LVLM systems. Future research may explore whether similar adaptive routing principles apply to other modalities or attention mechanisms.

Key Takeaways
  • COAST achieves 77.8% visual token reduction while maintaining 98.64% performance, demonstrating that aggressive pruning can be done intelligently without sacrificing reasoning quality.
  • The framework identifies Visual Aphasia as a failure mode where models lose visual grounding when low-attention tokens are prematurely discarded, validating the need for contextual preservation.
  • Adaptive semantic routing using attention entropy and contrastive scoring outperforms shallow scalar-based pruning across diverse benchmarks and LVLM architectures.
  • Training-free optimization enables immediate adoption without requiring model retraining, reducing deployment barriers for practitioners.
  • 2.15x latency speedup combined with minimal performance loss makes vision-language models practical for real-time and edge deployment scenarios.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles