AINeutralarXiv – CS AI · 7h ago6/10
🧠
Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics
Researchers present a cost model for optimizing cross-GPU attention operations in large language models, finding that routing queries is often cheaper than moving cache blocks when models are distributed across multiple nodes. The work applies to sparse-attention architectures like those in DeepSeek and GLM models, offering practical guidance for inference optimization on multi-node clusters.