Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets
Researchers introduce SourceTracker, a 300M-parameter encoder combined with a hybrid two-stage pipeline that uses vector search and fingerprinting to efficiently track code provenance in LLM-generated snippets. The system achieves logarithmic-time query complexity while maintaining high precision on billion-scale datasets, addressing scalability challenges in detecting plagiarism and license violations in AI-generated code.
The proliferation of code-generating LLMs has created a critical infrastructure problem: detecting whether generated code reproduces training data verbatim requires comparing against massive corpora, a task classical plagiarism detectors cannot handle efficiently. SourceTracker tackles this by combining semantic vector search—fast but potentially imprecise—with exact fingerprinting methods like Winnowing, creating a practical two-stage filtering approach that reduces computational overhead while maintaining detection accuracy.
This research addresses genuine legal and ethical concerns facing AI model developers. When LLMs reproduce copyrighted or license-restricted code without attribution, it exposes companies to intellectual property litigation and open-source license violations. The industry has lacked scalable technical solutions; existing fingerprint-based methods require linear-time searches across training sets, making them impractical for modern billion-parameter models trained on massive code repositories.
The hybrid approach demonstrates material improvements: on adapted code snippets (60+ tokens), the system outperforms pure fingerprinting by 5.4% while maintaining logarithmic search complexity. This performance profile suggests real deployment potential for code hosting platforms, IDE integrations, and AI model developers seeking compliance tools. The evaluation using LLM-based judges to assess semantic similarity beyond exact matches reflects realistic use cases where partial attribution and similar-source identification prove valuable.
Looking ahead, this work establishes a technical foundation for provenance tracking that could become standard in AI development pipelines. Regulatory bodies increasingly scrutinize AI training data and attribution practices; scalable provenance systems may become compliance requirements. The methodology generalizes beyond code, potentially applying to other domains where LLMs train on copyrighted material.
- →HybridSourceTracker combines vector search and fingerprinting to enable logarithmic-time provenance tracking on billion-scale code repositories.
- →The system outperforms classical fingerprinting methods by 5.4% on adapted code snippets 60+ tokens long while maintaining query efficiency.
- →Addresses critical compliance gaps as AI-generated code raises intellectual property and open-source license violation concerns.
- →LLM-based evaluation reveals retrieved snippets provide useful attribution context beyond exact-match detection, improving practical utility.
- →Scalable provenance tracking may become standard compliance infrastructure for AI model developers and code platforms.