Temporal Stability and Few-Shot Prompting in Math Task Assessment
A longitudinal study examined how AI models (Gemini and Coteach) perform on mathematics task classification using the Task Analysis Guide, testing stability across model versions and responsiveness to few-shot prompting. Results showed newer model versions produced mixed effects, but few-shot prompting consistently improved both models' accuracy, suggesting prompt engineering is more reliable than passive model updates for specialized educational tasks.
This research addresses a critical gap in understanding AI model reliability in educational contexts. While AI adoption in schools accelerates, educators and institutions lack empirical data on whether newer model versions actually perform better on domain-specific tasks. The study's finding that Gemini maintained 58% accuracy while Coteach declined from 75% to 50% across versions reveals an uncomfortable truth: software updates don't guarantee performance improvements, particularly for specialized applications.
The broader context reflects growing concerns about AI model degradation and the "drift" phenomenon where capabilities erode over time. Educational technology stakeholders have invested heavily in AI-assisted assessment tools, assuming each new release brings enhancements. This research challenges that assumption and highlights the importance of continuous evaluation rather than blind trust in vendor claims.
For the EdTech industry, the implications are substantial. The dramatic recovery of both models through few-shot prompting—Gemini to 67%, Coteach to 75%—demonstrates that implementation strategy matters more than passive model selection. This shifts responsibility from AI vendors to educators and administrators, who must actively optimize prompts rather than expecting out-of-the-box performance. Institutions may need to invest in prompt engineering expertise alongside tool procurement.
Looking forward, this research suggests the AI education market requires more rigorous, ongoing evaluation frameworks. Rather than chasing latest model versions, organizations should establish baseline performance metrics and regularly test against domain-specific benchmarks. The findings also underscore that general-purpose models and education-specific tools have distinct performance trajectories, requiring different evaluation and implementation approaches.
- →Model version updates produced opposite effects: Gemini stable at 58%, Coteach declined from 75% to 50%
- →Few-shot prompting consistently improved both models, demonstrating prompt engineering outperforms passive updates
- →Educational AI tool selection cannot rely on vendor claims alone; continuous empirical testing is essential
- →Organizations must invest in prompt optimization strategies to maximize specialized AI tool performance
- →General-purpose and education-specific AI models show different stability patterns across versions