Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
Researchers identify source-dependence as a critical failure mode in retrieval-augmented generation (RAG) systems, where multi-source medical AI systems provide different answers to identical questions based on which institutional source is retrieved. The study introduces TransplantQA, HERO-QA, and evaluation frameworks to audit this phenomenon, revealing that source disagreement is far more prevalent than previously measured.
RAG systems deployed in institutional settings often aggregate information from multiple sources, yet existing evaluation paradigms assume a single correct answer exists. This research exposes a fundamental blind spot: when institutional sources legitimately disagree on medical guidance, current NLP metrics cannot diagnose or measure the system's handling of this disagreement. The study demonstrates this through transplant patient education, where institutional handbooks contain genuine conflicts in recommendations.
The technical contribution centers on shifting evaluation from answer-level correctness to inter-source relationship analysis. HERO-QA implements hierarchical retrieval that grounds answers in specific sources, while a structured-output judge applies a validated 5-label taxonomy to classify source relationships. At scale, this approach uncovers substantially more disagreement than prior estimates suggested, indicating the problem was historically underestimated rather than overstated.
This work carries significant implications for deployed NLP systems in regulated domains. Medical AI systems must not only provide accurate information but also acknowledge source conflicts and uncertainty. The framework's domain-agnostic design transfers to legal and educational contexts, suggesting source-dependence is a systemic issue across knowledge work applications. For organizations deploying RAG systems in high-stakes environments, this research necessitates rethinking evaluation protocols and system transparency.
Looking forward, the field must develop standardized approaches to auditing source-dependence in production systems. This includes determining when systems should acknowledge disagreement versus synthesizing consensus, and how to communicate uncertainty to end users. The work establishes source-dependence as a legitimate axis of NLP evaluation rather than an edge case.
- βSource-dependence in multi-source RAG systems represents a critical evaluation gap not captured by single-answer correctness metrics.
- βInstitutional sources in regulated domains like medicine frequently contain genuine disagreements requiring explicit auditing mechanisms.
- βHERO-QA and structured taxonomy approaches enable systematic measurement of inter-source relationships at scale.
- βBetter retrieval methods reveal substantially higher prevalence of source disagreement than prior estimates indicated.
- βSource-dependence auditing is a domain-agnostic responsibility for all deployed multi-source NLP systems in knowledge work.