AIBearisharXiv – CS AI · 7h ago7/10
🧠
Beyond Public Access in LLM Pre-Training Data
Researchers using copyrighted O'Reilly Media books conducted membership inference attacks on OpenAI's language models, finding that GPT-4o exhibits patterns suggesting recognition of pay-walled content (AUROC 0.82) while GPT-4o Mini shows minimal recognition (AUROC 0.56). The findings highlight gaps in corporate transparency around AI training data sources and underscore the need for formal licensing frameworks.
🏢 OpenAI🧠 GPT-4