🧠 AI⚪ NeutralImportance 6/10

Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

arXiv – CS AI|Andrea Gurioli, Davide D'Ascenzo, Federico Pennino, Maurizio Gabbrielli, Stefano Zacchiroli|May 28, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce SourceTracker, a 300M-parameter encoder combined with a hybrid two-stage pipeline that uses vector search and fingerprinting to efficiently track code provenance in LLM-generated snippets. The system achieves logarithmic-time query complexity while maintaining high precision on billion-scale datasets, addressing scalability challenges in detecting plagiarism and license violations in AI-generated code.

Analysis

The proliferation of code-generating LLMs has created a critical infrastructure problem: detecting whether generated code reproduces training data verbatim requires comparing against massive corpora, a task classical plagiarism detectors cannot handle efficiently. SourceTracker tackles this by combining semantic vector search—fast but potentially imprecise—with exact fingerprinting methods like Winnowing, creating a practical two-stage filtering approach that reduces computational overhead while maintaining detection accuracy.

This research addresses genuine legal and ethical concerns facing AI model developers. When LLMs reproduce copyrighted or license-restricted code without attribution, it exposes companies to intellectual property litigation and open-source license violations. The industry has lacked scalable technical solutions; existing fingerprint-based methods require linear-time searches across training sets, making them impractical for modern billion-parameter models trained on massive code repositories.

The hybrid approach demonstrates material improvements: on adapted code snippets (60+ tokens), the system outperforms pure fingerprinting by 5.4% while maintaining logarithmic search complexity. This performance profile suggests real deployment potential for code hosting platforms, IDE integrations, and AI model developers seeking compliance tools. The evaluation using LLM-based judges to assess semantic similarity beyond exact matches reflects realistic use cases where partial attribution and similar-source identification prove valuable.

Looking ahead, this work establishes a technical foundation for provenance tracking that could become standard in AI development pipelines. Regulatory bodies increasingly scrutinize AI training data and attribution practices; scalable provenance systems may become compliance requirements. The methodology generalizes beyond code, potentially applying to other domains where LLMs train on copyrighted material.

Key Takeaways

→HybridSourceTracker combines vector search and fingerprinting to enable logarithmic-time provenance tracking on billion-scale code repositories.
→The system outperforms classical fingerprinting methods by 5.4% on adapted code snippets 60+ tokens long while maintaining query efficiency.
→Addresses critical compliance gaps as AI-generated code raises intellectual property and open-source license violation concerns.
→LLM-based evaluation reveals retrieved snippets provide useful attribution context beyond exact-match detection, improving practical utility.
→Scalable provenance tracking may become standard compliance infrastructure for AI model developers and code platforms.