MemoryDocDataSet: A Benchmark for Joint Conversational Memory and Long Document Reasoning
Researchers introduce MemoryDocDataSet, a new benchmark for evaluating AI systems that must simultaneously handle multi-session conversational memory and long document reasoning. The synthetic dataset reveals a significant performance gap in current architectures, with the best baseline achieving only 35.8% F1 on tasks requiring joint memory-document navigation.
MemoryDocDataSet addresses a critical blind spot in AI evaluation infrastructure. While individual capabilities—conversation understanding and document comprehension—have been extensively benchmarked, the intersection of these skills remains largely unexplored despite real-world applications frequently demanding both. This benchmark matters because it exposes fundamental architectural weaknesses in current systems attempting to scale to longer contexts and richer interaction histories.
The research emerges from growing pressure in the AI industry to handle increasingly complex reasoning tasks. Enterprise applications, legal document review systems, and conversational assistants all require models to track context across multiple sessions while extracting information from dense, lengthy documents. The Caselaw Access Project documents used here reflect genuine real-world complexity, moving beyond synthetic simplifications that plague many benchmarks.
The performance data provides sobering insights. The 34-point F1 drop between document-only retrieval (0.453) and hybrid questions (0.267) demonstrates that combining conversational memory with document navigation creates emergent difficulties beyond the sum of individual challenges. This gap suggests that current retrieval-augmented generation approaches may be fundamentally misaligned with tasks requiring context switching between conversation history and document corpora.
For the AI development community, this benchmark will likely become a standard evaluation tool, similar to how SQuAD transformed reading comprehension assessment. Teams developing enterprise AI systems face pressure to improve on these metrics. The release of baseline implementations and the generation pipeline enables reproducible comparisons. Future research will likely focus on unified architectures that jointly optimize memory navigation and document retrieval rather than treating them as separate problems.
- →MemoryDocDataSet reveals a 34-point F1 performance gap between document-only and hybrid memory-document tasks, exposing architectural limitations in current RAG systems.
- →75.1% of questions require joint conversation-document reasoning, making this the defining challenge rather than an edge case.
- →Best baseline achieves only 35.8% F1 on overall tasks, indicating substantial room for architectural innovation.
- →Benchmark uses real 20,000-50,000 token legal documents sourced from Caselaw Access Project, ensuring genuine complexity.
- →The synthetic dataset includes 50 micro-worlds with temporal event graphs and multi-persona conversations, enabling rigorous controlled evaluation.