From Data Heterogeneity to Convergence: A Data-Centric Review of Federated Learning
A comprehensive survey analyzes federated learning through a data-centric lens, examining how non-IID data heterogeneity, experimental splitting protocols, and adversarial vulnerabilities affect model convergence and stability. The research ranks data properties by their convergence impact and provides actionable guidance for practitioners designing FL systems with predictable performance.
Federated learning addresses a critical challenge in modern machine learning: enabling collaborative model training across distributed clients while preserving data privacy. This survey advances the field by shifting focus from general FL foundations to the specific mechanisms through which data characteristics govern training outcomes. The authors systematically categorize non-IID data heterogeneity into measurable traits, ranking their influence on convergence as strong, medium, or light while explaining underlying mechanisms across diverse domains including images, texts, and graphs.
The research emerges from a recognized gap in existing FL literature. Previous surveys cover security, applications, and general challenges but lack granular analysis connecting data properties directly to convergence behavior. This work bridges that gap by examining experimental splitting practices used in FL research, exposing artifacts these methodologies introduce, and demonstrating their performance implications.
For practitioners and researchers developing federated systems, this survey provides concrete, predictive guidance rather than abstract principles. By explicitly mapping data-related vulnerabilities to convergence-robustness trade-offs, the work enables informed design decisions. Organizations deploying FL across healthcare, finance, or other privacy-sensitive domains can anticipate performance degradation from specific data conditions and implement defenses accordingly.
The impact extends beyond academic research. As federated learning adoption accelerates in production environments, understanding data-driven convergence dynamics becomes commercially relevant. This survey establishes empirical foundations for estimating training efficiency, resource allocation, and timeline expectations when deploying FL systems with heterogeneous client data distributions.
- →Non-IID data heterogeneity's impact on FL convergence varies significantly—some traits strongly degrade performance while others have minimal effect.
- →Experimental data splitting protocols widely used in FL research introduce artifacts that measurably affect accuracy and convergence speed.
- →Adversarial defenses against data-related vulnerabilities create explicit trade-offs between convergence speed and robustness that practitioners must balance.
- →Data properties are primary determinants of FL system stability, making data-centric analysis essential for predictable training outcomes.
- →The survey provides actionable guidance linking concrete data characteristics to convergence predictions across images, texts, and graph data modalities.