#pretraining-data News & Analysis

3 articles tagged with #pretraining-data. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AINeutralarXiv – CS AI · Jun 116/10

🧠

Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality

Researchers propose WebGraphMix, a data selection framework that leverages web graph centrality scores to optimize pretraining data for language models without requiring labeled data or auxiliary classifiers. Testing on models up to 1B parameters shows that combining central and peripheral web regions in a 1:1 ratio improves performance to 41.4% versus 39.8% for uniform sampling, suggesting web topology captures complementary knowledge orthogonal to content-based approaches.

AINeutralarXiv – CS AI · Jun 96/10

🧠

MC-PDD: Masked Corpus-Level Pretraining Data Detection for Black-Box Large Language Models

Researchers introduce MC-PDD, a black-box method to detect whether specific datasets were used to pretrain large language models by analyzing prediction patterns on masked text. The technique works through standard API access without requiring model probability distributions, enabling practical auditing of closed-source LLMs and addressing transparency concerns around proprietary training data.

AINeutralarXiv – CS AI · Jun 16/10

🧠

Gap-K%: Measuring Top-1 Prediction Gap for Detecting Pretraining Data

Researchers propose Gap-K%, a novel method for detecting whether text was part of an LLM's pretraining data by analyzing the probability gap between a model's top prediction and the actual target token. The technique outperforms existing approaches on standard benchmarks and addresses critical privacy and copyright concerns surrounding the opaque datasets used to train large language models.