Robust Synchronisation for Federated Learning in The Face of Correlated Device Failure
Researchers introduce Availability-Weighted Probabilistic Synchronous Parallel (AW-PSP), an improved federated learning algorithm that addresses bias in node sampling when device availability and data distribution are correlated. The technique uses dynamic probability adjustments, Markov-based failure prediction, and distributed metadata management to improve fairness and robustness in edge computing environments where devices frequently fail or become unavailable.
Federated learning systems face a fundamental challenge when deploying machine learning across unreliable edge devices: devices with high availability naturally dominate training while frequently unavailable devices contribute minimally, potentially skewing learned models toward overrepresented data distributions. This research tackles a subtle but consequential problem in distributed AI systems where the correlation between device reliability and data characteristics creates systematic bias.
The technical landscape of federated learning has evolved to handle device heterogeneity, but most approaches treat availability as a random, independent phenomenon. Real-world deployments reveal patterns—certain geographic regions may have poor connectivity, specific device types fail more frequently, or user activity follows temporal patterns. When these availability patterns correlate with demographic or categorical data characteristics, standard sampling methods perpetuate representation gaps. AW-PSP distinguishes between transient failures (temporary connectivity issues) and chronic failures (persistent unavailability) using Markov chain predictions, enabling more intelligent node selection.
For organizations deploying federated learning at scale, this work directly impacts model quality and fairness outcomes. Companies developing edge AI applications, particularly in healthcare, finance, or cross-device machine learning, face pressure to ensure models generalize fairly across all participant populations. Poor fairness in federated learning can lead to models that underperform for specific groups, creating regulatory and reputational risks. The distributed hash table approach for decentralized metadata management also reduces coordination overhead, making the solution practical for large-scale deployments. Looking forward, as federated learning becomes more prevalent in production systems, availability-aware sampling will likely become standard practice rather than an optimization.
- →AW-PSP dynamically adjusts node sampling probabilities based on real-time availability predictions and failure correlation metrics to address bias in federated learning
- →The algorithm distinguishes between transient and chronic device failures using Markov-based prediction, enabling smarter participation management
- →Evaluation shows improved label coverage and reduced fairness variance compared to standard PSP, especially under correlated failure scenarios
- →Distributed Hash Table layer decentralizes metadata management, enabling scalability to large node counts without central coordination bottlenecks
- →The approach directly addresses a production deployment challenge where device reliability and data distribution correlations cause systematic model bias