y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

arXiv – CS AI|Andrea Gurioli, Davide D'Ascenzo, Federico Pennino, Maurizio Gabbrielli, Stefano Zacchiroli|
🤖AI Summary

Researchers introduce SourceTracker, a 300M-parameter encoder combined with a hybrid two-stage pipeline that uses vector search and fingerprinting to efficiently track code provenance in LLM-generated snippets. The system achieves logarithmic-time query complexity while maintaining high precision on billion-scale datasets, addressing scalability challenges in detecting plagiarism and license violations in AI-generated code.

Analysis

The proliferation of code-generating LLMs has created a critical infrastructure problem: detecting whether generated code reproduces training data verbatim requires comparing against massive corpora, a task classical plagiarism detectors cannot handle efficiently. SourceTracker tackles this by combining semantic vector search—fast but potentially imprecise—with exact fingerprinting methods like Winnowing, creating a practical two-stage filtering approach that reduces computational overhead while maintaining detection accuracy.

This research addresses genuine legal and ethical concerns facing AI model developers. When LLMs reproduce copyrighted or license-restricted code without attribution, it exposes companies to intellectual property litigation and open-source license violations. The industry has lacked scalable technical solutions; existing fingerprint-based methods require linear-time searches across training sets, making them impractical for modern billion-parameter models trained on massive code repositories.

The hybrid approach demonstrates material improvements: on adapted code snippets (60+ tokens), the system outperforms pure fingerprinting by 5.4% while maintaining logarithmic search complexity. This performance profile suggests real deployment potential for code hosting platforms, IDE integrations, and AI model developers seeking compliance tools. The evaluation using LLM-based judges to assess semantic similarity beyond exact matches reflects realistic use cases where partial attribution and similar-source identification prove valuable.

Looking ahead, this work establishes a technical foundation for provenance tracking that could become standard in AI development pipelines. Regulatory bodies increasingly scrutinize AI training data and attribution practices; scalable provenance systems may become compliance requirements. The methodology generalizes beyond code, potentially applying to other domains where LLMs train on copyrighted material.

Key Takeaways
  • HybridSourceTracker combines vector search and fingerprinting to enable logarithmic-time provenance tracking on billion-scale code repositories.
  • The system outperforms classical fingerprinting methods by 5.4% on adapted code snippets 60+ tokens long while maintaining query efficiency.
  • Addresses critical compliance gaps as AI-generated code raises intellectual property and open-source license violation concerns.
  • LLM-based evaluation reveals retrieved snippets provide useful attribution context beyond exact-match detection, improving practical utility.
  • Scalable provenance tracking may become standard compliance infrastructure for AI model developers and code platforms.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles