Benchmarking the Alignment of Data-Quality Metrics, Human Judgment and Land-Cover Segmentation Performance for Earth Observation
Researchers benchmarked data-quality metrics used to evaluate synthetic Earth observation images and found significant misalignment between automatic fidelity scores (FID, KID, IS, LPIPS, SSIM) and both human perception and downstream segmentation performance. Synthetic data flagged as low-quality by standard metrics actually improved model performance when combined with real data, suggesting current evaluation frameworks are inadequate for geospatial applications.
This research exposes a critical gap in how the machine learning community evaluates synthetic data quality, with direct implications for geospatial AI development. The study demonstrates that popular metrics like FID, originally designed around ImageNet features, fail to capture task-specific utility and often penalize semantically-preserving transformations like image rotation that humans find imperceptible. This misalignment creates practical problems: datasets rejected by automated quality gates may actually enhance downstream model performance, leading researchers to discard potentially valuable training data.
The findings emerge from a broader trend of increased reliance on synthetic data augmentation to overcome dataset scarcity and acquisition costs. As deep generative models improve, the bottleneck has shifted from generation capability to evaluation methodology. Traditional metrics measure visual fidelity in feature spaces optimized for ImageNet classification, not for domain-specific tasks like land-cover segmentation where semantic consistency matters more than pixel-level realism.
For the geospatial and Earth observation community, this challenges established procurement and validation practices. Organizations building satellite-based AI systems may currently reject synthetic datasets that could meaningfully improve model robustness and reduce reliance on expensive labeled imagery. The industry impact extends beyond academia: companies and agencies investing in synthetic data pipelines need evaluation frameworks tied to business outcomes rather than benchmark scores.
Looking forward, the focus must shift toward task-grounded evaluation metrics and mandatory human validation protocols for specialized domains. Researchers should develop domain-specific quality benchmarks that correlate with segmentation performance rather than adopting generic visual similarity measures. This work signals that downstream task performance and human judgment should become standard requirements in synthetic data evaluation pipelines.
- βFID and similar metrics poorly predict synthetic data utility for Earth observation, often disagreeing with human perception and downstream task performance.
- βSemantics-preserving perturbations like rotation significantly degrade automatic metric scores while remaining imperceptible to humans and preserving model utility.
- βSynthetic samples scoring poorly on standard metrics improved semantic segmentation performance when combined with real training data.
- βImageNet-pretrained feature spaces are unreliable quality indicators for domain-specific geospatial applications.
- βEvaluation of synthetic datasets should prioritize downstream task performance and human judgment over automatic fidelity metrics.