STaR-KV: Spatio-Temporal Adaptive Re-weighting for KV Cache Compression in GUI Vision-Language Models
Researchers introduce STaR-KV, a training-free compression framework that reduces key-value cache memory consumption in vision-language GUI agents by up to 40% while maintaining accuracy. The method addresses a critical bottleneck where models like UI-TARS-1.5-7B consume prohibitive GPU memory during multi-step interactions, enabling more practical deployment on standard accelerators.
Vision-language models deployed as GUI agents face a fundamental scalability challenge: their key-value caches grow linearly with each interaction step, rapidly exhausting GPU memory on even modest hardware. STaR-KV directly tackles this constraint through a three-axis compression approach that departs from prior methods' rigid assumptions. Where existing solutions treat all attention heads identically and apply fixed cutoff thresholds, STaR-KV recognizes that spatial importance patterns vary across attention subspaces and shift dynamically across layers.
The framework introduces three technical innovations: subspace-aware scoring using online spatial mutual information, temporal stability discounting to eliminate redundant entries from repeatedly-attended regions, and entropy-derived temperature adjustment for adaptive score distribution. This granular approach reflects deeper understanding of transformer attention mechanics, where different heads specialize in different features and those specializations evolve through depth.
The practical implications extend beyond academic interest. GUI automation represents a high-value application domain for AI agents—automating workflows across enterprise software, web applications, and system interfaces. The current memory constraints create friction in commercial deployment; reducing peak GPU consumption by 40% while maintaining benchmark performance directly addresses real production bottlenecks. With no computational overhead during compression, STaR-KV presents a viable retrofit for existing models without retraining.
Longer-term, this research signals emerging sophistication in post-training optimization. As foundation models proliferate across specialized domains, inference-stage compression techniques become increasingly valuable for cost-sensitive deployment. The trajectory suggests future architectural designs may natively incorporate adaptive caching mechanisms rather than treating compression as an afterthought.
- →STaR-KV reduces peak GPU memory by 40% on GUI agent tasks while maintaining state-of-the-art accuracy compared to existing KV compression methods.
- →The framework abandons single-saliency-map assumptions, instead implementing subspace-aware scoring that adapts across attention heads and layers.
- →Temporal stability discounting suppresses redundant cache entries from persistently-attended regions, directly addressing the linear growth problem in multi-step interactions.
- →Training-free deployment with negligible computational overhead (-0.07% FLOPs) enables immediate adoption on existing vision-language models.
- →GUI automation workloads consuming 76 GB for five screenshots become practical on standard 80 GB accelerators, reducing deployment friction in enterprise environments.