Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth
Researchers demonstrate that deep literature search pipelines dramatically improve retrieval performance (from ~20% to 80% recall) compared to basic API searches, while simultaneously revealing that human citation lists contain significant bias and are unsuitable as ground truth for evaluation. The study advocates for multi-dimensional evaluation metrics beyond simple recall to assess citation quality accurately.
This research addresses a fundamental challenge in academic information retrieval: how to evaluate whether literature search systems actually find relevant papers. The team implemented a Deep Research pipeline that processes full query papers and expands searches breadth-first through bibliographies, achieving substantial performance gains on the RollingEval benchmark. However, their more provocative finding challenges the entire evaluation paradigm—human reference lists, traditionally treated as gold-standard ground truth, show significant limitations. Using LLM-as-a-judge evaluation, researchers discovered that only 51% of human citations meet moderate relevance thresholds, while top AI re-rankers achieve 86-88% relevance rates. This gap reveals systematic bias in human citation behavior: researchers cite direct collaborators 2.5 times more frequently than AI systems would recommend based on topical relevance alone. The implications extend beyond academic publishing into AI system evaluation broadly. Current benchmarking methodologies across machine learning assume human annotations represent objective truth, yet this work demonstrates that human judgment contains predictable social and professional biases. The research argues for replacing single-metric evaluation with complementary measures including recall, topical relevance scoring, ranked-list diversity, and co-authorship-distance diagnostics. This methodology shift could improve how the entire AI research community validates search and retrieval systems. The findings suggest that AI-driven re-ranking may identify genuinely more relevant papers than human selection, challenging assumptions about human expertise as the ultimate arbiter of quality.
- →Deep literature search pipelines achieve 80% recall versus 20% for basic API searches through breadth-first bibliography expansion
- →Only 51% of human citations meet moderate relevance standards, compared to 86-88% for top AI re-rankers
- →Humans demonstrate 2.5x higher tendency to cite direct collaborators, revealing social bias in citation behavior
- →Single-metric evaluation is insufficient; multi-dimensional assessment including diversity and co-authorship distance is essential
- →AI-based relevance judgments may exceed human citation quality when social factors are removed