AINeutralarXiv – CS AI · 15h ago6/10
🧠
When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification
Researchers demonstrate that synthetic data generated by LLMs for patent classification shows mixed results, with improvements primarily driven by increased sample volume rather than data quality. The optimal strategy combines 20-30% real data with 70-80% synthetic data, though synthetic corpora can paradoxically harm retrieval performance despite improving classification metrics.