JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search
Researchers introduce JetViT, a hybrid Vision Transformer architecture that maintains accuracy of state-of-the-art models while delivering up to 1.79x faster throughput and 44.81% lower latency on high-resolution images. The innovation uses post-training attention search to convert full-attention models into efficient hybrid variants by strategically replacing redundant attention blocks.
JetViT addresses a critical bottleneck in deploying large vision models: the computational cost of processing high-resolution images remains prohibitively expensive despite advances in model accuracy. The research demonstrates that not all attention mechanisms in Vision Transformers equally contribute to output quality, enabling selective replacement of computationally expensive full-attention blocks with more efficient linear or window-attention variants. This post-training optimization approach proves particularly valuable because it preserves the learned weights from pre-trained models, avoiding costly retraining cycles.
The breakthrough builds on broader trends in efficient AI architecture design. As foundation models grow larger and more capable, practitioners face mounting pressure to reduce inference costs without sacrificing performance. Prior work explored efficient attention mechanisms independently, but JetViT systematically identifies which architectural components genuinely matter for specific vision tasks through automated search. Testing on DINOv3 and DepthAnythingV2 validates the approach across different foundation model families and tasks.
The practical implications extend across computer vision applications where latency and throughput directly impact real-world deployment. Industries requiring high-resolution image processing—medical imaging, autonomous systems, remote sensing—stand to benefit from reduced computational requirements. The ability to accelerate existing models without retraining lowers barriers for adoption among organizations with limited GPU resources. As enterprises increasingly deploy vision models at scale, efficiency improvements directly translate to reduced infrastructure costs and faster inference for end users, potentially accelerating adoption of vision AI across enterprise and edge computing scenarios.
- →JetViT achieves 1.79x throughput improvement and 44.81% latency reduction on H100 GPUs without accuracy loss
- →Post-training attention search intelligently replaces full-attention blocks with linear or window-attention alternatives
- →The method preserves learned weights from pre-trained models, eliminating expensive retraining requirements
- →Efficiency gains directly reduce computational and infrastructure costs for high-resolution vision model deployment
- →Approach generalizes across different vision foundation models, suggesting broad applicability to existing architectures