AdaMerge: Salience-Aware Adaptive Token Merging for Training-Free Acceleration of Vision Transformers
AdaMerge introduces a training-free method to accelerate Vision Transformers by improving token merging through salience-aware mechanisms and adaptive layer-wise compression. The approach outperforms existing token reduction methods across all computational efficiency benchmarks, maintaining superior accuracy-to-FLOPs ratios on ImageNet-1k evaluations.
AdaMerge addresses a critical computational bottleneck in Vision Transformers by advancing token merging methodology. The self-attention mechanism in ViTs requires quadratic computational resources relative to token count, creating practical deployment constraints. While previous token merging approaches like ToMe demonstrated promise as training-free solutions, they operated under a flawed assumption that all tokens contribute equally to model outputs, resulting in information degradation when aggressive compression was applied.
The research builds on established understanding of non-uniform attention patterns in transformer architectures. Token salience varies significantly across sequences, yet prior merging frameworks discarded this insight. AdaMerge incorporates two innovations addressing this gap: salience-weighted similarity uses column-wise feature-affinity centrality to identify and preserve high-importance tokens during merging, while adaptive merging intensity dynamically adjusts compression ratios per layer based on input-specific redundancy patterns.
Benchmark results demonstrate consistent improvements over competing approaches. At 13.4G FLOPs, AdaMerge achieves only 1.06% accuracy degradation compared to 1.45% for PiToMe and 4.62% for DSM on ViT-B/16. This performance gap widens at higher compression levels, suggesting the method's particular effectiveness under resource constraints. The training-free nature preserves practical advantages while delivering measurable quality improvements.
For practitioners deploying vision models in computationally constrained environments, AdaMerge represents tangible progress toward efficient transformer inference. The methodology's applicability extends beyond image classification to video processing and other vision-intensive tasks where token reduction remains costly. Future work likely explores integration with other acceleration techniques and extension to other transformer architectures.
- βAdaMerge combines salience-weighted token similarity with adaptive per-layer compression to improve Vision Transformer efficiency without retraining
- βThe framework outperforms existing token-merging methods across all computational efficiency levels, with accuracy advantages widening at higher compression ratios
- βTraining-free design enables immediate deployment in existing systems without modification to model architectures or training pipelines
- βSalience-aware mechanisms preserve high-importance tokens while aggressively merging redundant ones, reducing information loss during compression
- βResults demonstrate 1.06% accuracy degradation at 13.4G FLOPs versus 1.45-4.62% for competing approaches on ImageNet-1k