HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling
Researchers introduce HiDe, a training-free framework that improves Multimodal Large Language Models' (MLLMs) performance on high-resolution images by identifying that background interference—not object size—is the primary limitation. The method uses token-wise attention decoupling and layout-preserving techniques to achieve state-of-the-art results on multiple benchmarks while reducing memory usage by 75% compared to existing approaches.
The research challenges a widely-held assumption in the MLLM community about why these models struggle with high-resolution images. Rather than confirming that small object recognition drives performance degradation, the HiDe framework demonstrates that complex background interference presents the actual bottleneck. This reframing is significant because it redirects engineering efforts toward fundamentally different solutions than traditional zoom-in strategies.
The technical approach employs two decoupling stages. Token-wise Attention Decoupling isolates question-relevant tokens from visual noise, while Layout-Preserving Decoupling removes background information while maintaining spatial relationships. Critically, this is achieved without retraining models, making it immediately applicable to existing MLLM architectures. The framework demonstrates compatibility across multiple model families, evidenced by achieving state-of-the-art performance on V*Bench (92.1% for Qwen2.5-VL 7B, 91.6% for InternVL3 8B) and specialized high-resolution benchmarks HRBench4K and HRBench8K.
For developers deploying MLLMs in production, the 75% memory reduction represents substantial practical value. Resource-constrained applications can now process high-resolution imagery more efficiently. The training-free nature ensures existing models gain improvements without expensive fine-tuning cycles. As organizations increasingly demand MLLMs that handle document analysis, satellite imagery, and medical diagnostics—domains requiring high-resolution processing—this framework addresses a genuine operational bottleneck that has limited real-world MLLM deployment at scale.
- →HiDe identifies background interference rather than object size as the primary limitation in MLLM high-resolution performance
- →The training-free framework achieves SOTA results on multiple benchmarks without requiring model retraining
- →Memory usage is reduced 75% compared to previous training-free approaches while improving performance
- →Token-wise and Layout-Preserving Decoupling techniques can be applied to existing MLLM architectures across different model families
- →The framework enables practical deployment of high-resolution image processing in resource-constrained environments