#distributed-training News & Analysis

45 articles tagged with #distributed-training. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

45 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

LAYUP: Asynchronous decentralized gradient descent with LAYer-wise UPdates

Researchers present LayUp, an asynchronous decentralized gradient descent algorithm that enables faster distributed training of deep learning models through layer-wise updates and gossip-based communication. The method demonstrates 32% faster convergence than synchronous training while maintaining robustness to stragglers and requiring no extra buffering.

AIBullisharXiv – CS AI · Jun 237/10

🧠

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

FoMoE introduces a distributed training system that breaks the full-model replication requirement in Mixture-of-Experts (MoE) architectures by partitioning experts across workers. The approach achieves up to 1.42x communication cost reduction and 45x improvement over traditional distributed training, enabling efficient LLM pre-training across geographically dispersed commodity hardware.

AIBullisharXiv – CS AI · Jun 107/10

🧠

Piper: A Programmable Distributed Training System

Piper is a new distributed training system that separates strategy design from runtime implementation, allowing researchers to compose multiple parallelism strategies flexibly without manual reconfiguration. The system maintains performance parity with existing approaches like ZeRO while enabling efficiency gains through joint optimization of computation and communication in complex training scenarios.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Resource-aware Computation-Communication Overlap for multi-GPU ML Workloads

Researchers have developed a method to improve multi-GPU machine learning training by enabling computation and communication to execute simultaneously using shared-memory allocation and scheduling priority adjustments. The technique demonstrates up to 25.5% execution time reduction across NVIDIA and AMD GPUs without requiring modifications to vendor libraries.

🏢 Nvidia

AIBullisharXiv – CS AI · Jun 57/10

🧠

Multilingual Fine-Tuning via Localized Gradient Conflict Resolution

Researchers introduce Bucket-Level MOO, a distributed framework that addresses negative interference when fine-tuning Large Language Models across multiple languages by reformulating the problem as multi-objective optimization. The method enables conflict-aware parameter updates without excessive communication overhead while theoretically guaranteeing Refined Pareto Stationarity, improving multilingual performance across four LLM architectures.

AIBullisharXiv – CS AI · Jun 27/10

🧠

Model Parallelism With Subnetwork Data Parallelism

Researchers introduce Subnetwork Data Parallelism (SDP), a distributed training framework that reduces memory consumption by 28-60% during neural network pre-training by partitioning models into structured subnetworks trained across workers without exchanging activations. The method supports both backward and forward masking regimes and maintains or improves performance across transformer and CNN architectures.

AIBearisharXiv – CS AI · May 297/10

🧠

Does Distributed Training Undermine Compute Governance?

A research paper examines how distributed training algorithms could enable frontier AI model development outside traditional large datacenters, potentially circumventing compute governance regulations designed to monitor AI development. The authors propose countermeasures including chip tracking, whistleblowing programs, and forensic accounting to prevent regulatory evasion.

AIBullisharXiv – CS AI · May 287/10

🧠

Mitigating Staleness in Asynchronous Pipeline Parallelism via Basis Rotation

Researchers propose a basis rotation framework to address gradient staleness in asynchronous pipeline parallelism, a technique used for distributed AI training. By aligning the optimizer's coordinate system with the Hessian eigenbasis, the method reduces training iterations by 81.7% compared to existing asynchronous baselines, enabling more efficient large-scale model training.

AIBullishHugging Face Blog · May 277/10

🧠

Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL

Hugging Face's TRL library introduces Delta Weight Sync, a novel technique enabling efficient distribution of trillion-parameter models across distributed systems using hub bucket storage. This innovation addresses a critical bottleneck in large-scale AI model training and deployment by reducing synchronization overhead.

AIBullisharXiv – CS AI · May 117/10

🧠

ForgeVLA: Federated Vision-Language-Action Learning without Language Annotations

ForgeVLA introduces a federated learning framework that enables Vision-Language-Action models to train on distributed robot data without centralizing sensitive information or requiring manual language annotations. The system uses embodied instruction classifiers to automatically generate missing language labels and addresses vision-language feature collapse through contrastive learning and adaptive aggregation.

AIBullisharXiv – CS AI · May 77/10

🧠

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Researchers introduce CCL-D, a diagnostic system for detecting anomalies in large-scale AI model training that identifies GPU communication failures in under 6 minutes. Deployed across 4,000 GPUs over one year, the system addresses a critical bottleneck in distributed training where slow/hang anomalies typically require days to diagnose.

AIBearisharXiv – CS AI · Apr 207/10

🧠

Power to the Clients: Federated Learning in a Dictatorship Setting

Researchers identify a critical vulnerability in federated learning systems where malicious 'dictator clients' can erase other participants' contributions while preserving their own, compromising the collaborative training process. The study provides theoretical and empirical analysis of single and multiple dictator scenarios, revealing fundamental security weaknesses in decentralized machine learning architectures.

AIBullisharXiv – CS AI · Mar 177/10

🧠

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

Researchers developed MegaScale-Data, an industrial-grade distributed data loading architecture that significantly improves training efficiency for large foundation models using multiple data sources. The system achieves up to 4.5x training throughput improvement and 13.5x reduction in CPU memory usage through disaggregated preprocessing and centralized data orchestration.

AIBullisharXiv – CS AI · Feb 277/106

🧠

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Researchers introduce veScale-FSDP, a redesigned Fully Sharded Data Parallel system that overcomes limitations of current FSDP implementations used for training large-scale AI models. The new system features flexible RaggedShard format and structure-aware planning, achieving 5-66% higher throughput and 16-30% lower memory usage while supporting advanced training methods and scaling to tens of thousands of GPUs.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Subspace-Constrained Federated Learning with Low-Rank Adaptation

Researchers propose a subspace-regularized federated learning approach for low-rank adaptation (LoRA) that addresses geometric misalignment issues when training large language models across distributed clients with heterogeneous data. The method achieves superior performance on RoBERTa-large while demonstrating near-perfect basis overlap (0.9999) across multiple models and random seeds, outperforming existing federated learning baselines.

AIBullisharXiv – CS AI · Jun 196/10

🧠

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

Researchers introduce LoRDO, a distributed optimization framework that combines low-rank techniques with infrequent communication to reduce bandwidth requirements in foundation model training by approximately 10x. The method addresses a critical bottleneck in distributed training by enabling workers to perform effective low-rank projections without full-batch gradient access, achieving near-parity performance with standard distributed training at model scales of 125M-720M parameters.

AINeutralarXiv – CS AI · Jun 106/10

🧠

QSplitFL: Capability Aware Deep Q-Learning for Optimal Split Point Selection in Split Federated Learning

QSplitFL introduces a Deep Q-Network framework that optimizes split point selection in federated learning by considering device heterogeneity, using lightweight hardware metrics instead of model weights. The approach demonstrates improved convergence and accuracy across multiple datasets and neural network architectures while adapting to varying client capabilities.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Multi-Level Analyzation of Imbalance to Resolve Non-IID-Ness in Federated Learning

Researchers propose FedBB, a federated learning framework that addresses class imbalance across three levels—within classes, between classes, and across distributed clients—using a specialized loss function and client reweighting strategy. The approach improves model performance on non-IID data while minimizing privacy risks through limited statistical information requirements.

AINeutralarXiv – CS AI · Jun 106/10

🧠

From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning

A comprehensive survey analyzes federated learning through a data-centric lens, examining how non-IID data heterogeneity, experimental splitting protocols, and adversarial vulnerabilities affect model convergence and stability. The research ranks data properties by their convergence impact and provides actionable guidance for practitioners designing FL systems with predictable performance.

AINeutralarXiv – CS AI · Jun 95/10

🧠

HASA: Subnet Allocation for Compute-Constrained Model-Heterogeneous Federated Learning

Researchers propose HASA, a subnet allocation algorithm for federated learning that assigns model sizes to edge devices based on data heterogeneity rather than just compute constraints. The method improves prediction accuracy across distributed clients while maintaining fixed computational budgets, with implications for efficient on-device AI deployment.

AINeutralarXiv – CS AI · Jun 96/10

🧠

Model Multiplicity for Adversarial Detection in Small Language Model Training on Edge Devices

Researchers propose a novel defense mechanism called model multiplicity to detect poisoning attacks in distributed small language model training on edge devices. Instead of maintaining a single global model, the system trains multiple independent models on different device subsets, using divergence between them to identify adversarial behavior—outperforming traditional single-model defenses.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Amortizing Federated Adaptation: Hypernetwork Driven LoRA for Personalized Foundation Models

Researchers introduce HyperLoRA, a federated learning framework that addresses critical limitations in distributed fine-tuning of foundation models by using hypernetworks to generate personalized LoRA parameters and learned aggregation in product space, achieving faster convergence and better personalization across heterogeneous client distributions.

AIBullisharXiv – CS AI · Jun 46/10

🧠

AgentJet: A Flexible Swarm Training Framework for Agentic Reinforcement Learning

AgentJet is a decoupled distributed framework for training LLM-based reinforcement learning agents across multiple nodes, enabling heterogeneous multi-agent teams and fault-tolerant execution. The system achieves 1.5-10x training speedup through context tracking optimization and automates long-horizon RL research workflows without human intervention.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Boosting Multimodal Federated Learning via Chained Modality Optimization

Researchers propose FedMChain, a federated learning framework that addresses modality competition in multimodal machine learning by structuring training as sequential modality-specific phases rather than joint optimization. The approach combines phase-wise local optimization with sparse sign-guided server aggregation to improve model performance while reducing communication overhead.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

Researchers propose Straggler-Aware Group Control (SAGC), a dynamic optimization technique that improves the efficiency of synchronous reinforcement learning by adapting group sizes based on observed training behavior. The method addresses a critical bottleneck in on-policy RL where slow individual rollouts delay entire group computations, achieving better wall-clock performance while maintaining or improving model quality on reasoning benchmarks.

Page 1 of 2Next →