MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models
Researchers introduce MC-PDD, a black-box method to detect whether specific datasets were used to pretrain large language models by analyzing prediction patterns on masked text. The technique works through standard API access without requiring model probability distributions, enabling practical auditing of closed-source LLMs and addressing transparency concerns around proprietary training data.
MC-PDD addresses a fundamental opacity problem in large language model development. As LLMs become increasingly central to AI applications, the lack of visibility into pretraining datasets raises critical questions about copyright, data provenance, and potential bias propagation. The research tackles this by introducing a novel detection method that operates under strict black-box constraints—a realistic scenario for most commercial LLMs that expose only API interfaces.
The approach leverages masked language modeling, a technique familiar from BERT-style pretraining, to identify statistically significant prediction differences between texts that were likely included in training versus those that were not. By masking specific tokens and measuring hit rates, researchers can infer pretraining inclusion without accessing underlying model weights or probability distributions. This represents a methodological advance because existing state-of-the-art detection methods typically require full model access, making them impractical for auditing proprietary systems.
For the AI industry, this work carries meaningful implications. Developers and organizations relying on closed-source models gain a practical tool for due diligence around training data sources. Content creators and data owners gain potential mechanisms for copyright verification and protection. The ability to audit pretraining data through standard API access democratizes model transparency efforts, shifting leverage away from model developers who previously controlled all information about training procedures.
The research suggests emerging standardization around LLM auditing practices. As regulatory frameworks evolve around AI transparency and data rights, detection methodologies like MC-PDD may become foundational infrastructure for compliance verification. Future work likely focuses on refining detection accuracy, understanding adversarial robustness, and scaling these methods across diverse model architectures.
- →MC-PDD enables detection of pretraining data inclusion using only black-box API access, eliminating the need for model internals or probability distributions.
- →The method achieves performance comparable to existing detection approaches while operating under stricter constraints, validating its practical applicability.
- →Black-box pretraining data detection tools support copyright verification, model auditing, and transparency efforts for closed-source LLMs.
- →Masked token prediction analysis reveals statistically significant differences between pretrained and unseen data across multiple datasets and model types.
- →This work establishes practical infrastructure for LLM auditing that could become increasingly critical as AI regulation and data rights frameworks mature.