βBack to feed
π§ AIβͺ Neutral
No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models
π€AI Summary
Researchers developed CDD (Contamination Detection via output Distribution) to identify data contamination in small language models by measuring output peakedness. The study found that CDD only works when fine-tuning produces verbatim memorization, failing at chance level with parameter-efficient methods like low-rank adaptation that avoid memorization.
Key Takeaways
- βCDD detection method depends critically on whether fine-tuning produces verbatim memorization in language models.
- βParameter-efficient fine-tuning like low-rank adaptation can produce undetectable contamination since models learn without memorizing.
- βThe study tested models ranging from 70M to 410M parameters on datasets including GSM8K, HumanEval, and MATH.
- βA memorization threshold governs whether contamination becomes detectable through output distribution analysis.
- βCurrent output-distribution detection methods have significant blind spots with modern efficient training techniques.
#language-models#contamination-detection#memorization#fine-tuning#model-evaluation#parameter-efficiency#research
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles