y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

LLMs Can Leak Training Data But Do They Want To? A Propensity-Aware Evaluation of Memorization in LLMs

arXiv – CS AI|Gianluca Barmina, Peter Schneider-Kamp, Lukas Galke Poech|
🤖AI Summary

Researchers introduce PropMe, a framework that distinguishes between LLMs' capability to leak training data when directly attacked versus their propensity to do so during normal use. Testing on open models reveals a significant gap: while models can be forced to reproduce training data through adversarial prompts, they rarely do so voluntarily, suggesting memorization risk is lower in practical deployment than worst-case evaluations suggest.

Analysis

This research addresses a critical blind spot in AI safety evaluation: the difference between what models *can* do and what they *will* do. Existing memorization studies typically measure extractability through targeted attacks—essentially asking 'if we try hard enough, can we get the model to leak data?' The PropMe framework flips this to ask 'how often does this happen naturally?' This distinction matters enormously for risk assessment and real-world deployment decisions.

The findings suggest a more nuanced security landscape than prior work implied. The researchers tested fully-open language models on multiple datasets and languages, consistently finding that generic or dataset-specific prompts trigger far less memorization than adversarial prefix attacks. DFM Decoder's reduced memorization on Common Pile after continual pre-training on different data further demonstrates that memorization isn't static—it evolves with training. This has implications for model safety practices and suggests that continual learning on diverse data may naturally mitigate certain privacy risks.

For developers and organizations deploying LLMs, this research provides more actionable guidance than binary 'models leak data' conclusions. It suggests that while extractability remains a theoretical concern requiring defense mechanisms, the practical risk of spontaneous training data exposure may be substantially lower than capability-focused evaluations indicate. However, the findings don't eliminate privacy concerns—they contextualize them. Organizations should still implement safeguards, but can calibrate threat models based on actual propensity rather than worst-case scenarios.

Future memorization audits should follow this framework by reporting both extractability and propensity metrics, enabling more realistic risk assessments that inform deployment decisions and security investments.

Key Takeaways
  • LLMs can reproduce training data under adversarial attack but rarely do so during normal use, creating a significant capability-propensity gap.
  • PropMe's propensity-aware framework provides more realistic memorization evaluation than traditional worst-case extractability tests.
  • Continual pre-training on diverse data reduces both memorization capability and propensity, suggesting practical mitigation strategies exist.
  • Comprehensive memorization audits should report both worst-case extractability and ordinary-use leakage propensity for accurate risk assessment.
  • The research indicates practical privacy risk from LLMs may be lower than previous studies suggested, though targeted attacks remain a concern.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles