Epidemiology of Model Collapse: Modeling Synthetic Data Contamination via Bilayer SIR Dynamics
Researchers propose a bilayer SIR epidemic model to analyze how synthetic data contamination spreads across AI systems when models train on each other's outputs. Through theoretical analysis, simulations, and GPT-2 experiments, they demonstrate that cross-contamination can sustain itself (Rβ > 1) and identify detection-based filtering as the most effective intervention strategy.
This research addresses a critical vulnerability in the AI ecosystem: model collapse through synthetic data contamination. Unlike previous analyses treating this as isolated degradation, the study recognizes that AI systems exist in an interconnected environment where synthetic outputs from one model become training data for others, creating feedback loops that amplify quality degradation. The bilayer framework models data corpora and AI models as coupled epidemic populations, borrowing epidemiological tools to quantify when contamination becomes self-sustaining.
The findings are grounded in both theory and empirical validation. The derived basic reproduction number (Rβ) incorporates parameters for cross-layer transmission rates and recovery mechanisms, with calibration against real AI text prevalence data showing supercritical dynamics across multiple scenarios. Sensitivity analysis reveals that synthetic-text detection effectiveness is the highest-leverage parameter, suggesting that technological countermeasures matter more than model retraining alone. The matched-budget experiments across 1,088 runs provide nuanced evidence that source diversity offers modest but diminishing protection.
For the AI industry, this work has significant implications for data governance and model development strategies. Companies relying on open-source training data face increasing contamination risk as synthetic content proliferates. The research suggests that detection and filtering represent more cost-effective interventions than architectural changes. However, the degradation of the agent-based model under heterogeneous network conditions hints that real-world contamination dynamics may be more complex than mean-field predictions, warranting continued investigation into how diverse model sizes and training methodologies affect ecosystem resilience.
- βSynthetic data contamination in AI systems exhibits epidemic-like dynamics with a reproduction number Rβ > 1, indicating self-sustaining spread across model ecosystems.
- βDetection-based filtering of synthetic content emerges as the highest-impact intervention, more effective than source diversity or model retraining strategies.
- βThe bilayer SIR/SIRS framework successfully predicts contamination thresholds and qualitatively matches experimental observations across GPT-2 degradation chains.
- βReal-world contamination dynamics may be more complex than mean-field models predict, particularly in heterogeneous networks with diverse model types and scales.
- βImmunity waning in filtered corpora means that one-time cleanup efforts provide only temporary protection against re-contamination in open ecosystems.