Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs
Researchers propose using Large Language Models to automatically detect and annotate Personally Identifiable Information (PII) in HTTP traffic without requiring fixed taxonomies or extensive manually-labeled datasets. The approach combines deterministic preprocessing with LLM-based classification and includes a synthetic traffic generator for evaluation, demonstrating flexible privacy audit capabilities across multiple PII domains.
This research addresses a critical challenge in automated privacy auditing: the scarcity of labeled training data and the inflexibility of existing systems tied to static PII taxonomies. Traditional machine learning approaches for detecting privacy leakage in web and mobile traffic require substantial manual annotation efforts and cannot easily adapt when privacy definitions or compliance requirements evolve. The paper demonstrates that LLMs can dynamically interpret PII categories provided at runtime, eliminating the coupling between detection systems and fixed label schemas.
The practical contribution extends beyond detection methodology. By developing an LLM-based synthetic traffic generator with validated PII annotations, the researchers address a fundamental research bottleneck: evaluating privacy tools without exposing sensitive real-world user data. This approach enables reproducible, privacy-preserving research and reduces reliance on scarce labeled datasets that have historically hampered the field.
The implications span multiple stakeholder groups. Security researchers gain tools for more flexible privacy audits as regulations and threat landscapes evolve. Organizations can conduct privacy assessments without accumulating real user data for training purposes. Developers benefit from privacy detection systems that adapt to custom or emerging PII definitions without retraining cycles.
Looking forward, the effectiveness of this approach depends on LLM robustness against adversarial obfuscation and encoding techniques attackers might employ to evade detection. Integration with production security pipelines requires benchmarking against real-world traffic complexity and false-positive rates. The framework's ability to handle novel PII categories not seen during development remains a critical open question.
- βLLMs enable taxonomy-agnostic PII detection in HTTP traffic without rigid predefined categories
- βSynthetic traffic generation with LLMs solves the privacy paradox of training on sensitive data
- βMulti-stage pipeline architecture combines deterministic preprocessing with instance-level annotation for accuracy
- βApproach demonstrates transferability across different PII domains and granularity levels
- βFlexible framework adapts to evolving privacy regulations without requiring system retraining