AgriGov: A Structured Multilingual Dataset Curation for Indian Government Schemes for Farmers
AgriGov introduces a curated trilingual dataset (English-Hindi-Marathi) containing 8,000 parallel sentence pairs focused on Indian agricultural government schemes and farmer welfare programs. The dataset combines automated data collection, machine translation, and human post-editing to create domain-specific resources for machine translation, question-answering, and information retrieval systems aimed at farmer-facing applications.
AgriGov addresses a critical gap in multilingual NLP resources by focusing on agricultural policy documentation in languages that serve India's farming population directly. The dataset combines practical utility with rigorous methodology, employing a three-stage translation pipeline that leverages both automated tools and human validation to ensure accuracy in domain-specific terminology where mistranslations could have material consequences for farmers accessing government benefits.
The project reflects broader trends in AI development toward underserved languages and sectors. While English-dominated NLP has achieved sophistication, farmers in India predominantly speak regional languages, creating a bottleneck between government digital services and their intended beneficiaries. By structuring data around semantic fields like eligibility criteria and application processes, AgriGov enables downstream applications that could meaningfully reduce information asymmetry.
For developers and organizations building farmer-facing fintech or agricultural platforms, this dataset provides a foundation for deploying accurate translation and question-answering systems without starting from zero. The inclusion of 50 government schemes with verified provenance means applications built on AgriGov can cite authoritative sources, reducing legal and compliance risks in a sector where misinformation about entitlements has documented harms.
The real-world impact depends on adoption rates among developers and whether the dataset becomes a standard reference in agricultural AI applications. Success requires integration into popular ML frameworks and continued expansion beyond the initial schemes. The schema-driven approach also enables future scaling to other Indian languages and potentially other nations with similar agricultural policy documentation challenges.
- βAgriGov provides 8,000 parallel sentence pairs in English-Hindi-Marathi specifically for agricultural government schemes, filling a gap in multilingual NLP resources for underserved languages.
- βThe dataset uses a human-corrected translation pipeline combining Google Translate, MarianMT, and manual post-editing to ensure domain-specific accuracy in agricultural terminology.
- βApplications include machine translation, question-answering, and information retrieval systems designed to help farmers access government welfare schemes in native languages.
- βThe structured schema-driven approach with verified provenance enables reproducible experiments and reduces compliance risks for farmer-facing fintech and agricultural platforms.
- βSuccess depends on developer adoption and integration into ML frameworks; potential expansion to other Indian languages and nations with similar policy documentation challenges remains possible.