y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

arXiv – CS AI|Bole Ma, Jan Eitzinger, Harald K\"ostler, Gerhard Wellein|
🤖AI Summary

Researchers present a cost model for optimizing cross-GPU attention operations in large language models, finding that routing queries is often cheaper than moving cache blocks when models are distributed across multiple nodes. The work applies to sparse-attention architectures like those in DeepSeek and GLM models, offering practical guidance for inference optimization on multi-node clusters.

Analysis

Modern large language models increasingly rely on sparse attention mechanisms where queries selectively attend to specific key-value cache blocks rather than all tokens. As model sizes exceed single-GPU capacity, these blocks get partitioned across multiple GPUs, creating a fundamental architectural challenge: when a query needs to attend to blocks on a different node, systems must decide whether to move the cache to the query or move the query to the cache. This paper investigates that tradeoff empirically on real hardware.

The key insight is that Multi-head Latent Attention (MLA) and similar compression techniques reduce routed queries to ~1 KB representations, making network transfer of the query cheaper than gathering and moving larger cache blocks. The authors developed a topology-aware cost model accounting for probe latency, transfer time, compute, return paths, and merge operations. Testing on an H100 cluster with InfiniBand, they achieved prediction accuracy within 7% of observed behavior and derived a closed-form predicate for choosing between routing and fetching strategies.

For the AI infrastructure space, this work addresses a critical pain point in distributed inference systems. As agentic workloads proliferate with multiple sub-agents querying shared code repositories, inference efficiency directly impacts operational costs and latency-sensitive applications. The cost model's generalizability beyond MLA—applicable to any sparse-attention system—makes it immediately relevant to production deployments.

Practitioners deploying large models across clusters now have quantitative guidance for optimization. The research suggests that network topology and probe latency matter more than peak bandwidth, potentially shifting how operators provision multi-node inference infrastructure and select networking hardware.

Key Takeaways
  • Routing small queries is often cheaper than moving large cache blocks in distributed attention systems
  • The proposed cost model predicts cross-instance attention performance within ~7% accuracy on real hardware
  • Network probe latency matters more than peak bandwidth for deciding routing versus fetching strategies
  • The framework applies to multiple sparse-attention architectures including DeepSeek and GLM models
  • Adding new architectures requires measuring only two coefficients rather than rebuilding the entire model
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles