Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025
A comprehensive audit of 1,603 NLP papers from 2018-2025 reveals that while researchers increasingly report operational annotation details like recruitment and expertise, critical information for assessing data validity—such as annotator training, language proficiency, compensation, and inter-annotator agreement—remains frequently omitted. The study establishes a scalable framework and reporting taxonomy to improve reproducibility and reliability in NLP research.
This large-scale assessment addresses a fundamental credibility gap in NLP research by systematically documenting what annotation practices papers actually report versus what would be needed for rigorous evaluation. The researchers analyzed 2,667 annotation tasks across major venues, using an LLM-assisted pipeline validated against human judgments, demonstrating that machine extraction can achieve near-human reliability in identifying reporting patterns. Their findings reveal an asymmetry in disclosure: papers tend to document logistical details like annotator recruitment but systematically omit validity-critical information like training protocols, agreement metrics, and demographic data.
This transparency gap matters because annotation quality directly determines dataset reliability, which cascades through all downstream NLP systems. When papers fail to report inter-annotator agreement or adjudication procedures, readers cannot assess whether observed model performance reflects genuine capability or artifacts of low-quality labels. The audit shows this problem is particularly acute in model-evaluation studies, where annotation quality directly validates claimed improvements.
The industry impact extends beyond academic reproducibility. As enterprises deploy NLP systems, understanding annotation provenance becomes critical for liability and performance prediction. Models trained on poorly-documented or inadequately-validated datasets introduce opacity into commercial applications. The paper's bare-minimum reporting framework provides actionable guidance that could standardize practices across the field, potentially improving dataset quality at scale.
Looking forward, adoption of these recommendations by conference organizers and journals could create immediate improvements, while long-term solutions may require standardized annotation documentation platforms that capture metadata automatically during dataset creation.
- →NLP papers frequently omit critical annotation validity details including inter-annotator agreement, annotator training, language proficiency, and compensation information.
- →Annotation reporting has improved over time but remains inconsistent across venues, topics, and use cases, with model-evaluation studies showing the worst compliance.
- →An LLM-assisted extraction pipeline successfully identified annotation practices across 1,603 papers with performance approaching human-adjudicated standards.
- →The study establishes a unified taxonomy and bare-minimum reporting framework to standardize annotation documentation and improve NLP research reproducibility.
- →Documentation gaps directly undermine dataset reliability assessments, creating downstream risks for commercial NLP deployments and model performance claims.