DataDignity: Training Data Attribution for Large Language Models
Researchers introduce DataDignity, a new framework for attributing large language model outputs to specific training documents. The study presents FakeWiki, a benchmark of 3,537 fabricated Wikipedia articles designed to test provenance tracking, and proposes ScoringModel, a supervised contrastive ranker that improves document attribution accuracy from 35% to 52.2% recall compared to existing baselines.
The challenge of identifying which source documents support an LLM's responses has become increasingly critical as these models are deployed in high-stakes domains requiring transparency and accountability. DataDignity addresses a fundamental gap in LLM auditability: while correctness checking is important, auditors often need to trace knowledge claims back to specific sources to verify reliability and detect hallucinations. This provenance tracking directly impacts trust in AI systems, particularly in legal, medical, and journalistic applications where source attribution carries legal and ethical weight.
The research builds on growing concerns about LLM hallucinations and lack of interpretability. Prior approaches relied on lexical matching and retrieval-based methods that fail when models paraphrase or synthesize information. FakeWiki deliberately weakens these shortcuts by including source-preserving paraphrases, topically similar anti-documents, and adversarial query transformations inspired by jailbreaking techniques. This controlled environment reveals that simple text retrieval misses over 60% of relevant source documents.
ScoringModel's 17-point improvement over baselines demonstrates that mapping response and document features into a shared embedding space, trained with contrastive learning, significantly enhances attribution accuracy. The model's robustness across nine different instruction-tuned LLMs and jailbreak-variant queries suggests practical deployability. SteerFuse, the training-free activation-steering approach, performing as second-best indicates that interpretability through model internals offers a complementary path forward.
For AI governance and enterprise adoption, this work enables more rigorous auditing workflows where LLM outputs can be systematically validated against training corpora. This capability becomes essential as regulators increasingly demand explainability and source traceability, particularly under emerging AI liability frameworks.
- βScoringModel achieves 52.2% Recall@10 for training data attribution, improving 17.2 points over the strongest baseline retrieval method.
- βFakeWiki benchmark with 3,537 controlled articles separates true source support from topical or lexical resemblance, addressing fundamental evaluation gaps.
- βTraining-free activation-steering (SteerFuse) achieves second-best performance, suggesting model internals contain valuable provenance signals.
- βMethod demonstrates 15.7-point improvements on adversarially transformed queries, indicating robustness against jailbreaking and query manipulation.
- βFramework enables verifiable AI auditability critical for high-stakes applications requiring source attribution and hallucination detection.