Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics
Researchers present a cost model for optimizing cross-GPU attention operations in large language models, finding that routing queries is often cheaper than moving cache blocks when models are distributed across multiple nodes. The work applies to sparse-attention architectures like those in DeepSeek and GLM models, offering practical guidance for inference optimization on multi-node clusters.
Modern large language models increasingly rely on sparse attention mechanisms where queries selectively attend to specific key-value cache blocks rather than all tokens. As model sizes exceed single-GPU capacity, these blocks get partitioned across multiple GPUs, creating a fundamental architectural challenge: when a query needs to attend to blocks on a different node, systems must decide whether to move the cache to the query or move the query to the cache. This paper investigates that tradeoff empirically on real hardware.
The key insight is that Multi-head Latent Attention (MLA) and similar compression techniques reduce routed queries to ~1 KB representations, making network transfer of the query cheaper than gathering and moving larger cache blocks. The authors developed a topology-aware cost model accounting for probe latency, transfer time, compute, return paths, and merge operations. Testing on an H100 cluster with InfiniBand, they achieved prediction accuracy within 7% of observed behavior and derived a closed-form predicate for choosing between routing and fetching strategies.
For the AI infrastructure space, this work addresses a critical pain point in distributed inference systems. As agentic workloads proliferate with multiple sub-agents querying shared code repositories, inference efficiency directly impacts operational costs and latency-sensitive applications. The cost model's generalizability beyond MLA—applicable to any sparse-attention system—makes it immediately relevant to production deployments.
Practitioners deploying large models across clusters now have quantitative guidance for optimization. The research suggests that network topology and probe latency matter more than peak bandwidth, potentially shifting how operators provision multi-node inference infrastructure and select networking hardware.
- →Routing small queries is often cheaper than moving large cache blocks in distributed attention systems
- →The proposed cost model predicts cross-instance attention performance within ~7% accuracy on real hardware
- →Network probe latency matters more than peak bandwidth for deciding routing versus fetching strategies
- →The framework applies to multiple sparse-attention architectures including DeepSeek and GLM models
- →Adding new architectures requires measuring only two coefficients rather than rebuilding the entire model