🧠 AI🔴 BearishImportance 6/10

Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks

arXiv – CS AI|Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn|March 5, 2026 at 05:00 AM

🤖AI Summary

A research study tested 11 AI tools on their ability to classify the cognitive demand of mathematical tasks, finding they achieved only 63% accuracy on average with no tool exceeding 83%. The tools showed systematic bias toward middle-category classifications and struggled with reasoning about underlying cognitive processes versus surface textual features.

Key Takeaways

→AI tools achieved only 63% average accuracy in classifying cognitive demand of mathematical tasks, with no tool exceeding 83%.
→Education-specific AI tools performed no better than general-purpose tools like ChatGPT and Claude.
→All tools exhibited systematic bias toward middle-category levels and struggled with extreme cognitive demand classifications.
→AI tools overweighted surface textual features rather than understanding underlying cognitive processes.
→The findings highlight significant limitations in current AI tools for educational applications and teacher workflow integration.

Mentioned in AI

Companies

Perplexity→

Models

ChatGPTOpenAI

ClaudeAnthropic

GeminiGoogle

GrokxAI