#distributed-inference News & Analysis

8 articles tagged with #distributed-inference. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AIBullisharXiv – CS AI · Jun 27/10

🧠

Leyline: KV Cache Directives for Agentic Inference

Leyline introduces a new serving-side primitive for managing KV cache in agentic LLMs, enabling efficient content editing and removal without full re-computation. The system uses declarative directives and RoPE-rotation corrections to handle policy-driven cache modifications, improving cache efficiency by 11.2 percentage points and agent solve rates by 14.3 percentage points.

AIBullisharXiv – CS AI · May 277/10

🧠

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

StreamSplit introduces a novel framework enabling continuous contrastive learning on edge devices by dynamically partitioning computation between local and cloud resources. Using reinforcement learning and uncertainty guidance, the system reduces latency by up to 4.7x and bandwidth by 77.1% while maintaining near-server accuracy, making distributed AI inference practical for resource-constrained hardware.

AINeutralarXiv – CS AI · Mar 47/105

🧠

Federated Inference: Toward Privacy-Preserving Collaborative and Incentivized Model Serving

Researchers introduce Federated Inference (FI), a new collaborative paradigm where independently trained AI models can work together at inference time without sharing data or model parameters. The study identifies key requirements including privacy preservation and performance gains, while highlighting system-level challenges that differ from traditional federated learning approaches.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Collaborative Edge-to-Server Inference for Vision-Language Models

Researchers propose a collaborative edge-to-server inference framework for vision-language models that reduces communication costs by selectively transmitting only high-entropy regions of interest rather than full-resolution images. The two-stage approach maintains inference accuracy while substantially decreasing bandwidth requirements across visual question-answering tasks.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

Researchers propose Task-Aware Coactivation Grouping (TACG), a framework for optimizing Mixture-of-Experts (MoE) model inference across distributed GPUs by grouping experts based on task-specific activation patterns rather than global averages. The approach reduces communication costs by 31.39% while maintaining load balance, addressing a critical efficiency bottleneck in multi-task AI serving.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

Researchers present a cost model for optimizing cross-GPU attention operations in large language models, finding that routing queries is often cheaper than moving cache blocks when models are distributed across multiple nodes. The work applies to sparse-attention architectures like those in DeepSeek and GLM models, offering practical guidance for inference optimization on multi-node clusters.

AINeutralarXiv – CS AI · May 126/10

🧠

Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum

Researchers propose an adaptive framework for dynamically partitioning deep neural networks across edge-cloud infrastructure, addressing limitations of static approaches. Testing on real hardware demonstrates 27-35% energy reductions and 6-23% latency improvements compared to static baselines, validating the effectiveness of runtime-adaptive strategies for heterogeneous computing environments.

AIBullisharXiv – CS AI · May 46/10

🧠

Space Network of Experts: Architecture and Expert Placement

Researchers present Space-XNet, a framework for efficiently deploying mixture-of-experts language models across satellite constellations using optimized expert placement strategies. The approach achieves a threefold latency reduction compared to conventional methods, addressing key challenges in executing energy-intensive AI workloads in space where computing and communication resources are severely constrained.