y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

arXiv – CS AI|Gaurav Sahu, Laurent Charlin, Christopher Pal|
🤖AI Summary

Researchers demonstrate that deep literature search pipelines dramatically improve retrieval performance (from ~20% to 80% recall) compared to basic API searches, while simultaneously revealing that human citation lists contain significant bias and are unsuitable as ground truth for evaluation. The study advocates for multi-dimensional evaluation metrics beyond simple recall to assess citation quality accurately.

Analysis

This research addresses a fundamental challenge in academic information retrieval: how to evaluate whether literature search systems actually find relevant papers. The team implemented a Deep Research pipeline that processes full query papers and expands searches breadth-first through bibliographies, achieving substantial performance gains on the RollingEval benchmark. However, their more provocative finding challenges the entire evaluation paradigm—human reference lists, traditionally treated as gold-standard ground truth, show significant limitations. Using LLM-as-a-judge evaluation, researchers discovered that only 51% of human citations meet moderate relevance thresholds, while top AI re-rankers achieve 86-88% relevance rates. This gap reveals systematic bias in human citation behavior: researchers cite direct collaborators 2.5 times more frequently than AI systems would recommend based on topical relevance alone. The implications extend beyond academic publishing into AI system evaluation broadly. Current benchmarking methodologies across machine learning assume human annotations represent objective truth, yet this work demonstrates that human judgment contains predictable social and professional biases. The research argues for replacing single-metric evaluation with complementary measures including recall, topical relevance scoring, ranked-list diversity, and co-authorship-distance diagnostics. This methodology shift could improve how the entire AI research community validates search and retrieval systems. The findings suggest that AI-driven re-ranking may identify genuinely more relevant papers than human selection, challenging assumptions about human expertise as the ultimate arbiter of quality.

Key Takeaways
  • Deep literature search pipelines achieve 80% recall versus 20% for basic API searches through breadth-first bibliography expansion
  • Only 51% of human citations meet moderate relevance standards, compared to 86-88% for top AI re-rankers
  • Humans demonstrate 2.5x higher tendency to cite direct collaborators, revealing social bias in citation behavior
  • Single-metric evaluation is insufficient; multi-dimensional assessment including diversity and co-authorship distance is essential
  • AI-based relevance judgments may exceed human citation quality when social factors are removed
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles