AINeutralarXiv – CS AI · 14h ago6/10
🧠
Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth
Researchers demonstrate that deep literature search pipelines dramatically improve retrieval performance (from ~20% to 80% recall) compared to basic API searches, while simultaneously revealing that human citation lists contain significant bias and are unsuitable as ground truth for evaluation. The study advocates for multi-dimensional evaluation metrics beyond simple recall to assess citation quality accurately.