AINeutralarXiv – CS AI · 18h ago6/10
🧠
MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models
Researchers introduce MC-PDD, a black-box method to detect whether specific datasets were used to pretrain large language models by analyzing prediction patterns on masked text. The technique works through standard API access without requiring model probability distributions, enabling practical auditing of closed-source LLMs and addressing transparency concerns around proprietary training data.