Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum
A comprehensive study evaluates four state-of-the-art LLMs (GPT-4o, Claude Sonnet 4, Qwen3-235B, Kimi K2) for use as AI tutors in Nepal's K-10 curriculum, revealing significant pedagogical gaps despite high technical accuracy. The research identifies critical failure modes including inability to simplify complex concepts for young learners and poor cultural contextualization, concluding that current LLMs require human oversight and curriculum-specific fine-tuning before classroom deployment in low-resource regions.
This research addresses a consequential gap in AI development: while LLMs demonstrate strong performance on standardized benchmarks in Western contexts, their readiness for real-world educational deployment in non-Western, resource-constrained regions remains largely untested. The study's systematic evaluation across Nepal's K-10 curriculum reveals that technical competence does not translate directly into pedagogical effectiveness. The identified "Expert's Curse"—where models excel at problem-solving but struggle to explain concepts accessibly—represents a fundamental mismatch between AI capabilities and educational requirements. This distinction matters because it exposes how optimization for raw accuracy masks failures in practical utility.
The research emerges from growing pressure to leverage AI for educational equity, particularly in underserved regions where qualified tutors are scarce. However, the findings suggest that deploying frontier models as autonomous tutors risks perpetuating educational inequity rather than resolving it. Regional models like Kimi K2 showed even sharper limitations, particularly in cultural contextualization, highlighting that geographic proximity does not ensure contextual relevance.
For the AI development industry, this study validates the necessity for curriculum-aligned benchmarking and multi-dimensional evaluation frameworks beyond aggregate accuracy metrics. The proposed "human-in-the-loop" strategy and fine-tuning blueprint offer pathways forward, but require investment in localized adaptation—a labor-intensive process that challenges the scalability assumptions underlying AI-for-education initiatives. For educators and policymakers in low-resource contexts, the message is clear: current off-the-shelf solutions require substantial modification before deployment. The research establishes a methodological template that other regions can replicate, potentially catalyzing more context-aware AI development.
- →Frontier LLMs achieve 97% aggregate reliability but fail significantly in pedagogical clarity and cultural adaptation for non-Western educational contexts.
- →The "Expert's Curse" and "Foundational Fallacy" reveal that technical accuracy does not ensure effective teaching, especially for younger learners and lower-grade material.
- →Regional AI models exhibit worse performance in cultural contextualization than global frontier models, undermining assumptions about geographic proximity benefits.
- →Human-in-the-loop deployment and curriculum-specific fine-tuning are necessary prerequisites before LLM-based tutoring in low-resource classrooms.
- →The research establishes a replicable evaluation framework combining curriculum alignment and pedagogical metrics that other regions can apply to assess AI readiness.