AIBearisharXiv – CS AI · 15h ago7/10
🧠
Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications
A comprehensive survey examines Pretraining Data Exposure (PDE) in large language models, unifying two previously isolated research areas—membership inference and data contamination—to assess whether specific data appeared in LLM training datasets. The work formalizes exposure levels, reviews attack and defense mechanisms, and highlights privacy and evaluation integrity risks as model sizes and training data scales continue to grow.