AINeutralarXiv – CS AI · 6h ago6/10
🧠
Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data
Researchers propose Gap-K%, a novel method for detecting whether text was part of an LLM's pretraining data by analyzing the probability gap between a model's top prediction and the actual target token. The technique outperforms existing approaches on standard benchmarks and addresses critical privacy and copyright concerns surrounding the opaque datasets used to train large language models.