y0news
← Feed
←Back to feed
🧠 AIβšͺ Neutral

No Memorization, No Detection: Output Distribution-Based Contamination Detection in Small Language Models

arXiv – CS AI|Omer Sela||1 views
πŸ€–AI Summary

Researchers developed CDD (Contamination Detection via output Distribution) to identify data contamination in small language models by measuring output peakedness. The study found that CDD only works when fine-tuning produces verbatim memorization, failing at chance level with parameter-efficient methods like low-rank adaptation that avoid memorization.

Key Takeaways
  • β†’CDD detection method depends critically on whether fine-tuning produces verbatim memorization in language models.
  • β†’Parameter-efficient fine-tuning like low-rank adaptation can produce undetectable contamination since models learn without memorizing.
  • β†’The study tested models ranging from 70M to 410M parameters on datasets including GSM8K, HumanEval, and MATH.
  • β†’A memorization threshold governs whether contamination becomes detectable through output distribution analysis.
  • β†’Current output-distribution detection methods have significant blind spots with modern efficient training techniques.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles