#gpu-clusters News & Analysis

5 articles tagged with #gpu-clusters. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBullisharXiv – CS AI · Apr 67/10

🧠

Glia: A Human-Inspired AI for Automated Systems Design and Optimization

Researchers have developed Glia, an AI architecture using large language models in a multi-agent workflow to autonomously design computer systems mechanisms. The system generates interpretable designs for distributed GPU clusters that match human expert performance while providing novel insights into workload behavior.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Move the Query, Not the Cache: Characterizing Cross-Instance Latent Attention Redistribution Across GPU Fabrics

Researchers present a cost model for optimizing cross-GPU attention operations in large language models, finding that routing queries is often cheaper than moving cache blocks when models are distributed across multiple nodes. The work applies to sparse-attention architectures like those in DeepSeek and GLM models, offering practical guidance for inference optimization on multi-node clusters.

AINeutralarXiv – CS AI · May 76/10

🧠

Resilient AI Supercomputer Networking using MRC and SRv6

OpenAI and Microsoft have deployed MRC, a new RDMA-based transport protocol combined with SRv6 static routing, to eliminate tail latency issues in massive AI training clusters exceeding 100K GPUs. The system uses multi-plane Clos topologies and intelligent load-balancing to bypass network failures without interrupting synchronous training jobs, addressing a critical bottleneck in frontier model development.

🏢 OpenAI

AINeutralarXiv – CS AI · May 46/10

🧠

Rethinking Network Topologies for Cost-Effective Mixture-of-Experts LLM Serving

Researchers challenge the necessity of expensive high-bandwidth networks for Mixture-of-Experts LLM serving, demonstrating that lower-cost switchless topologies deliver 20.6-56.2% better cost-effectiveness than industry-standard scale-up architectures. The analysis reveals current network infrastructure is over-provisioned, with implications for data center economics and AI deployment efficiency.

AINeutralOpenAI News · Jun 95/108

🧠

Techniques for training large neural networks

Large neural networks are driving recent AI advances but present significant training challenges that require coordinated GPU clusters for synchronized calculations. The technical complexity of orchestrating distributed computing resources remains a key engineering obstacle in scaling AI systems.