When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification
Researchers demonstrate that synthetic data generated by LLMs for patent classification shows mixed results, with improvements primarily driven by increased sample volume rather than data quality. The optimal strategy combines 20-30% real data with 70-80% synthetic data, though synthetic corpora can paradoxically harm retrieval performance despite improving classification metrics.
This research addresses a critical question in machine learning: when does synthetically generated data actually improve model performance versus merely inflating metrics through volume effects? The study evaluates six open-source LLMs on WIPO patent classification for assistive technologies, revealing that headline improvements often mask statistical reality. When researchers replicated their best result using simple random sampling with replacement on real data, the improvement collapsed from +0.582 to +0.024, demonstrating that claimed gains predominantly reflect sample size benefits rather than synthetic data quality.
The research identifies a striking phenomenon: the relationship between data fidelity and classification performance inverts as real data availability increases. In low-resource scenarios, maximum mean discrepancy (MMD) scores correlate strongly with performance (r = +0.95), but this relationship flips at the 1:10 real-to-synthetic ratio (r = -0.73). This inversion suggests that when real data is scarce, volume dominates; when real data is abundant, synthetic artifacts become liabilities.
For practitioners implementing synthetic data strategies, the findings recommend a balanced approach: allocating budgets toward 20-30% real data with 70-80% synthetic data outperforms pure strategies in both directions. However, the research reveals a critical caveat: synthetic corpora that boost classification F1 scores by up to 0.58 simultaneously degraded Jaccard-overlap retrieval metrics and reduced normalized discounted cumulative gain (nDCG@10) by 26%. This trade-off indicates that synthetic data optimization for one task may introduce distribution shifts harmful to downstream applications, a consideration often overlooked in ML pipelines that assume performance metrics translate across use cases.
- βSynthetic data improvements in patent classification primarily reflect volume effects rather than data quality gains.
- βOptimal resource allocation combines 20-30% real data with 70-80% synthetic data, outperforming purely real or synthetic approaches.
- βThe relationship between data fidelity and model performance inverts as real data availability increases.
- βSynthetic corpora can simultaneously improve classification metrics while degrading retrieval performance by 26%.
- βCurrent LLM-generated patent data exhibits prompt-family variations that harm downstream information retrieval tasks.