Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs
Researchers introduce MEDS (Math Education Digital Shadows), a dataset of 28,000 personas from 14 LLMs designed to evaluate how language models reason about mathematics and report their confidence levels. The dataset integrates math proficiency with psychological measures like anxiety and self-efficacy, revealing that LLMs exhibit human-like biases including negative attitudes and overconfidence in mathematical reasoning.
MEDS addresses a critical gap in AI evaluation by moving beyond traditional benchmarking that measures only correct answers. The dataset captures how 14 major LLM families—including Mistral, Qwen, DeepSeek, Granite, Phi, and Grok—perform across mathematical tasks while tracking psychological dimensions like anxiety and confidence. This approach acknowledges that educational AI requires more than raw accuracy; it demands understanding how models communicate uncertainty and confidence to learners.
The research reveals that LLMs exhibit distinctly human-like mathematical biases, including logical fallacies and overconfidence despite incorrect reasoning. These findings matter because educational AI tutors must avoid amplifying poor mathematical thinking patterns or false confidence. When students interact with AI tutors that express unjustified certainty, learning outcomes deteriorate. The 28,000 personas with psychological metadata enable researchers to isolate family-specific behaviors and failure modes.
For the AI education sector, MEDS provides accountability infrastructure. Developers of AI tutoring systems can use this dataset to identify which models demonstrate appropriate epistemic humility and which perpetuate misconceptions. Schools and edtech platforms considering LLM deployment can reference this data to make informed decisions about model selection. The integration of cognitive network science alongside proficiency metrics sets a new standard for responsible AI assessment in education.
Future work should examine how these model biases affect actual student learning outcomes and whether educational scaffolding can mitigate LLM overconfidence. Open availability of MEDS could accelerate development of mathematically-aware AI safety practices.
- →MEDS dataset tracks 28,000 LLM personas across math tasks, anxiety measures, and confidence scoring rather than accuracy alone
- →LLMs exhibit human-like mathematical biases including overconfidence and logical fallacies that could harm student learning
- →Dataset covers 14 LLM families revealing family-specific behavioral patterns in mathematical reasoning
- →Psychological profiling of AI models sets new standards for responsible deployment in educational technology
- →Resource enables safer AI tutor development by exposing model limitations beyond standard benchmarks