Sign-Language Datasets at Scale: A Comprehensive Survey on Resources, Benchmarks, and Annotation Standards
Researchers have conducted a comprehensive survey of 120 sign-language datasets across 35 languages, identifying critical gaps in annotation standards, linguistic coverage, and real-world applicability. The study introduces a standardized 24-field datasheet and open-source documentation framework to improve dataset quality and advance accessibility technologies for Deaf and Hard-of-Hearing communities.
This research addresses a significant infrastructure problem in machine learning: the lack of standardized, high-quality datasets for sign-language AI applications. While computer vision and NLP have benefited from massive, well-documented datasets, sign-language resources remain fragmented across academic institutions with inconsistent labeling conventions and limited linguistic representation. The survey's identification of 120 existing datasets reveals both progress and fragmentation—many resources exist, but their incompatibility hampers systematic development of sign-language recognition and translation systems.
The broader context reflects growing recognition that AI accessibility requires linguistic diversity and community-centered design. Previous sign-language AI initiatives faced obstacles including modality imbalance (video data differs fundamentally from text), annotation granularity inconsistencies, and signer bias—where training data overrepresents certain regional dialects or individual signers, degrading real-world performance. These technical challenges directly impact whether AI tools can serve DHH communities effectively or perpetuate existing accessibility gaps.
For developers and organizations building accessibility tools, this survey provides practical guidance: the open-source datasheet standard enables reproducible benchmarking and inter-dataset comparison, reducing duplicative effort in dataset construction. The GitHub repository creates a common reference point, allowing researchers to identify data gaps and prioritize future collection efforts. Investment in sign-language technology remains niche but growing, with applications spanning education, healthcare, and workplace accessibility. Companies adopting standardized annotation practices early gain competitive advantages in building robust translation and recognition systems.
Moving forward, the field must address whether existing datasets achieve sufficient scale and diversity to train production-grade models. Standardization alone cannot solve fundamental data scarcity; large-scale collection initiatives similar to those that enabled English speech recognition breakthroughs remain necessary for sign-language parity.
- →A comprehensive survey of 120 sign-language datasets across 35 languages reveals fragmentation and inconsistent annotation standards limiting AI development.
- →The study identifies modality imbalance, annotation granularity, and signer bias as critical technical challenges affecting real-world sign-language AI performance.
- →A new 24-field standardized datasheet and public GitHub repository enable reproducible evaluation and consistent documentation practices across future datasets.
- →Current dataset limitations constrain sign-language recognition, translation, and production systems from meeting actual Deaf and Hard-of-Hearing communication needs.
- →Standardization frameworks reduce duplicative research effort and help organizations identify priority areas for future large-scale data collection initiatives.