#data-contamination News & Analysis

11 articles tagged with #data-contamination. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

11 articles

AIBearisharXiv – CS AI · Jun 197/10

🧠

Calibration Without Comprehension: Diagnosing the Limits of Fine-Tuning LLMs for Vulnerability Detection in Systems Software

A new research framework called CWE-Trace challenges the claim that large language models can reliably detect software vulnerabilities, revealing that fine-tuned models achieve only 52.1% accuracy at best and lack genuine security reasoning despite appearing well-calibrated. The study of 834 Linux kernel samples shows that models exhibit systematic failure patterns that persist across datasets and resist correction through fine-tuning, suggesting they memorize patterns rather than understand vulnerability detection.

AIBearisharXiv – CS AI · Jun 17/10

🧠

NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models

Researchers introduce NumLeak, a framework revealing that frontier large language models memorize public numeric benchmarks from pretraining data rather than genuinely understanding underlying concepts. The study demonstrates that models achieve near-perfect recall on financial and economic metrics when prompted with dates, but this performance collapses on recent holdout data, indicating memorization rather than reasoning capability.

AIBearisharXiv – CS AI · May 277/10

🧠

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

A comprehensive survey examines Pretraining Data Exposure (PDE) in large language models, unifying two previously isolated research areas—membership inference and data contamination—to assess whether specific data appeared in LLM training datasets. The work formalizes exposure levels, reviews attack and defense mechanisms, and highlights privacy and evaluation integrity risks as model sizes and training data scales continue to grow.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Perfect score on IPhO 2025 theory by Gemini agent

Google's Gemini 3.1 Pro Preview achieved a perfect score on IPhO 2025 theory problems across five runs, surpassing previous AI performance that fell behind top human contestants. However, the researchers acknowledge potential data contamination since the model was released after the competition.

🧠 Gemini

AIBearisharXiv – CS AI · Jun 106/10

🧠

A Controlled Audit of Pretraining Contamination in Public Medical Vision-Language Benchmarks

Researchers audited major medical vision-language models for pretraining data contamination across public benchmarks like SLAKE-En and PathVQA, finding measurable image-side overlap (up to 19.8%) and text-side signals suggesting potential training data leakage. However, manual verification revealed distributional rather than pixel-level duplication, and several detection methods proved unreliable when tested against external baselines, raising questions about contamination assessment methodology.

AINeutralarXiv – CS AI · May 296/10

🧠

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

Researchers introduce LaRA, a framework for detecting data contamination in reinforcement learning post-trained large language models by analyzing layer-wise representations. The method identifies contamination through geometric deviations across neural network layers, outperforming existing detection approaches that rely on output-level signals unreliable for RL-trained models.

AINeutralarXiv – CS AI · May 276/10

🧠

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

Researchers introduce TSFMAudit, the first systematic method for detecting data contamination in time series foundation models (TSFMs) pretrained on large datasets. The approach identifies contamination by analyzing how quickly models adapt to evaluation data, with contaminated datasets showing unusually efficient loss reduction and minimal backbone movement during fine-tuning.

AINeutralarXiv – CS AI · May 116/10

🧠

Detecting Distillation Data from Reasoning Models

Researchers have developed Token Probability Deviation (TPD), a method to detect whether questions were included in a reasoning model's distillation training data. The technique addresses data contamination risks in reasoning distillation, where benchmark data may inadvertently inflate model performance metrics, achieving up to 31% improvement in detection accuracy.

AINeutralarXiv – CS AI · Apr 156/10

🧠

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Researchers propose League of LLMs (LOL), a benchmark-free evaluation framework that uses mutual peer assessment among multiple LLMs to overcome data contamination and evaluation bias issues. Testing on eight mainstream models reveals 70.7% ranking consistency while uncovering model-specific behaviors like memorization patterns and family-based scoring bias in OpenAI models.

🏢 OpenAI

AIBearisharXiv – CS AI · Mar 36/104

🧠

Wikipedia in the Era of LLMs: Evolution and Risks

A new research study analyzes how Large Language Models are impacting Wikipedia content and structure, finding approximately 1% influence in certain categories. The research warns of potential risks to AI benchmarks and natural language processing tasks if Wikipedia becomes contaminated by LLM-generated content.

AI × CryptoBearishCoinTelegraph – AI · Mar 37/107

🤖

OpenZeppelin finds data contamination in OpenAI’s EVMbench

OpenZeppelin discovered significant flaws in OpenAI's EVMbench dataset, including data contamination from training leaks and at least four incorrectly classified high-severity vulnerabilities. This finding raises concerns about the reliability of AI tools used for blockchain security auditing.