AINeutralarXiv – CS AI · 3h ago6/10
🧠
Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets
Researchers introduce SourceTracker, a 300M-parameter encoder combined with a hybrid two-stage pipeline that uses vector search and fingerprinting to efficiently track code provenance in LLM-generated snippets. The system achieves logarithmic-time query complexity while maintaining high precision on billion-scale datasets, addressing scalability challenges in detecting plagiarism and license violations in AI-generated code.