AINeutralarXiv โ CS AI ยท 14h ago7/10
๐ง
PaperScope: A Multi-Modal Multi-Document Benchmark for Agentic Deep Research Across Massive Scientific Papers
Researchers introduce PaperScope, a comprehensive benchmark for evaluating multi-modal AI systems on complex scientific research tasks across multiple documents. The benchmark reveals that even advanced systems like OpenAI Deep Research and Tongyi Deep Research struggle with long-context retrieval and cross-document reasoning, exposing significant gaps in current AI capabilities for scientific workflows.
๐ข OpenAI