←Back to feed
🧠 AI⚪ NeutralImportance 4/10
No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models
🤖AI Summary
Researchers developed CDD (Contamination Detection via output Distribution) to identify data contamination in small language models by measuring output peakedness. The study found that CDD only works when fine-tuning produces verbatim memorization, failing at chance level with parameter-efficient methods like low-rank adaptation that avoid memorization.
Key Takeaways
- →CDD detection method depends critically on whether fine-tuning produces verbatim memorization in language models.
- →Parameter-efficient fine-tuning like low-rank adaptation can produce undetectable contamination since models learn without memorizing.
- →The study tested models ranging from 70M to 410M parameters on datasets including GSM8K, HumanEval, and MATH.
- →A memorization threshold governs whether contamination becomes detectable through output distribution analysis.
- →Current output-distribution detection methods have significant blind spots with modern efficient training techniques.
#language-models#contamination-detection#memorization#fine-tuning#model-evaluation#parameter-efficiency#research
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Related Articles