ChemQuests: A Curated Chemistry Question-Answer Database Extracted from ChemRxiv papers
ChemQuests is a new curated dataset containing 952 question-answer pairs extracted from chemistry research papers, designed to advance chemistry-focused natural language processing. The dataset bridges the gap between rapidly expanding chemistry literature and the need for domain-specific training data for AI models and retrieval systems.
ChemQuests addresses a genuine bottleneck in specialized NLP development. As chemistry literature grows exponentially, researchers lack accessible, curated datasets that combine high-quality question-answer pairs with source attribution. This dataset represents a practical response to that gap, offering 952 QA pairs spanning 17 chemistry subfields with explicit traceability to original text segments.
The construction methodology demonstrates pragmatic AI-assisted curation rather than pure automation. The pipeline combines OCR for text extraction, GPT-4o for QA generation, and fuzzy-search verification—a hybrid approach that leverages large language models while maintaining quality control through verification steps. This balanced methodology could serve as a model for creating domain-specific datasets in other technical fields lacking curated training resources.
For the AI and NLP community, ChemQuests enables several concrete applications: fine-tuning domain-adapted language models for chemistry, building more accurate chemistry-specific retrieval systems, and supporting chemistry education tools. The emphasis on conceptual, mechanistic, and experimental questions suggests the dataset captures diverse reasoning types rather than mere factual lookup. However, the dataset's scope remains limited—952 pairs from 155 papers represents a foundation rather than comprehensive coverage.
The authors acknowledge limitations and outline expansion plans, indicating awareness that domain-specific datasets require iterative development and expert validation. This release likely catalyzes further chemistry-NLP research and potentially inspires similar dataset construction in other scientific disciplines facing literature overload challenges.
- →ChemQuests provides 952 curated question-answer pairs from chemistry research with explicit source traceability for NLP applications.
- →The dataset construction combines automated tools (OCR, GPT-4o) with verification methods, balancing scale with quality control.
- →Coverage spans 17 chemistry subfields with emphasis on conceptual, mechanistic, and experimental questions relevant to domain understanding.
- →Primary applications include training domain-adapted language models, building chemistry-specific search systems, and chemistry education support.
- →The limited initial dataset (155 papers) serves as a foundation requiring future expansion and expert validation for broader utility.