#educational-ai News & Analysis

45 articles tagged with #educational-ai. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

45 articles

AIBullisharXiv – CS AI · Jun 27/10

🧠

Beyond End-to-End Video Models: An LLM-Based Multi-Agent System for Educational Video Generation

Researchers introduce LASEV, an LLM-based multi-agent system that generates educational videos by decomposing production into specialized agents rather than relying on end-to-end video models. The system achieves 95% cost reduction and over one million videos daily while maintaining high quality through structured reasoning, semantic critique, and deterministic compilation.

AIBearisharXiv – CS AI · Jun 27/10

🧠

Identifying High-Confidence Social Biases in LLMs for Trustworthy Conversational Tutoring Agents

Researchers evaluated large language models used in conversational tutoring systems and found they struggle to detect social biases in educational contexts while maintaining high confidence in incorrect assessments. The study reveals that LLMs are significantly more prone to biased behavior in naturalistic tutoring conversations than in controlled benchmarks, posing risks to student learning outcomes.

AIBearisharXiv – CS AI · May 277/10

🧠

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Researchers introduced LiveK12Bench, a dynamic benchmark for evaluating Large Multimodal Models on realistic high school examinations across multiple disciplines. The study reveals that advanced LMMs like GPT-4 experience significant performance degradation when subjected to exam-realistic constraints, dropping from 79 to 53 points when process rigor and efficiency are jointly evaluated, exposing critical gaps between theoretical capabilities and practical educational readiness.

🧠 GPT-5

AIBearisharXiv – CS AI · May 117/10

🧠

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Research reveals that AI models, particularly few-shot large language models, struggle significantly with mid-range quality responses in automated short answer scoring, while fine-tuned models and human experts maintain consistent performance across all quality levels. This degradation raises fairness concerns for students with developing understanding, emphasizing the need for quality-conditioned evaluation metrics.

🧠 GPT-4🧠 GPT-5🧠 Claude

AIBearisharXiv – CS AI · Apr 147/10

🧠

Edu-MMBias: A Three-Tier Multimodal Benchmark for Auditing Social Bias in Vision-Language Models under Educational Contexts

Researchers present Edu-MMBias, a comprehensive framework for detecting social biases in Vision-Language Models used in educational settings. The study reveals that VLMs exhibit compensatory class bias while harboring persistent health and racial stereotypes, and critically, that visual inputs bypass text-based safety mechanisms to trigger hidden biases.

AIBearisharXiv – CS AI · Mar 56/10

🧠

Baseline Performance of AI Tools in Classifying Cognitive Demand of Mathematical Tasks

A research study tested 11 AI tools on their ability to classify the cognitive demand of mathematical tasks, finding they achieved only 63% accuracy on average with no tool exceeding 83%. The tools showed systematic bias toward middle-category classifications and struggled with reasoning about underlying cognitive processes versus surface textual features.

🏢 Perplexity🧠 ChatGPT🧠 Claude

AINeutralarXiv – CS AI · Mar 47/102

🧠

Faster, Cheaper, More Accurate: Specialised Knowledge Tracing Models Outperform LLMs

Research comparing Knowledge Tracing (KT) models to Large Language Models (LLMs) for predicting student responses found that specialized KT models significantly outperform LLMs in accuracy, speed, and cost-effectiveness. The study demonstrates that domain-specific models are superior to general-purpose LLMs for educational prediction tasks, with LLMs being orders of magnitude slower and more expensive to deploy.

AINeutralarXiv – CS AI · Jun 236/10

🧠

The Correct Answer Trap: Pedagogically-Grounded Detection and Feedback for Hidden Misconceptions

Researchers demonstrate that automated educational feedback systems fail to detect hidden misconceptions when students arrive at correct answers through flawed reasoning, with fine-tuned classifiers achieving only 57% detection accuracy. A reasoning model reaches 84% accuracy but generates excessive false positives, prompting the proposal of a detect-verify-escalate pipeline that routes uncertain cases to diagnostic questions rather than immediate teacher escalation.

AINeutralarXiv – CS AI · Jun 195/10

🧠

Confidence-Aware Automated Assessment of Student-Drawn Scientific Models

Researchers developed an automated Vision Transformer-based system to score student-drawn scientific models, addressing the costly manual assessment burden in science education. The confidence-aware framework selectively automates scoring of high-confidence submissions while deferring uncertain cases to human reviewers, demonstrating improved reliability across NGSS-aligned assessments.

AINeutralarXiv – CS AI · Jun 106/10

🧠

Large-scale semantic mapping of learner agency and autonomy reveals what measurement and generative AI research overlook

Researchers conducted a large-scale semantic analysis of 8,954 definitions and 2,700 scale items across 14,000+ publications to map how learner agency and autonomy are conceptualized and measured. They identified three core dimensions (task regulation, intrinsic motivation, and sociocultural action) and found that existing measurement scales systematically underrepresent the sociocultural aspect, while current generative AI applications in education narrowly focus on learning control.

AIBearisharXiv – CS AI · Jun 106/10

🧠

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

Researchers introduce RealMath-Eval, a benchmark revealing that state-of-the-art LLM judges fail to accurately evaluate authentic student mathematical reasoning, performing significantly worse on real exam responses (MSE ~2.96) than on synthetic LLM-generated solutions (MSE ~1.17). The study identifies an "Evaluation Gap" stemming from human errors occupying a more diverse semantic space than the predictable patterns found in synthetic errors.

AIBearisharXiv – CS AI · Jun 96/10

🧠

Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

Researchers evaluated how large language models (GPT and Grok) perform at grading graduate-level research reports, finding significant inconsistencies both within individual models and between different models. The study reveals that interaction history causes models to systematically drift from human grading standards, raising concerns about fairness in automated academic assessment.

🧠 Grok

AIBullisharXiv – CS AI · Jun 96/10

🧠

A Comparative Study of Student Perspectives on Technical Writing Feedback Quality: Evaluating LLMs, SLMs, and Humans in Computer Science Topics

A research study compares feedback quality from locally-hosted small language models (SLMs), commercial LLMs like GPT-4, and human instructors across computer science courses. The findings show that quantized Llama-3.1 matched commercial LLM performance while offering privacy and cost advantages, though human feedback remained superior for specialized writing tasks.

🧠 GPT-4🧠 Llama

AINeutralarXiv – CS AI · Jun 96/10

🧠

A Finetuned SpeechLLM for Joint Multi-Granular L2 Assessment and Natural-Language Rationales

Researchers propose a fine-tuned speech language model that provides both multi-level L2 English proficiency assessment and natural-language explanations for its predictions. The model demonstrates competitive performance on standard benchmarks while offering improved interpretability, though generated rationales show lower reliability at granular word-level assessments.

AINeutralarXiv – CS AI · Jun 45/10

🧠

From Motion Signals to Insights: A Unified Framework for Student Behavior Analysis and Feedback in Physical Education Classes

Researchers propose an AI framework combining motion signal analysis with large language models to analyze student behavior in outdoor physical education classes. The system generates automated pedagogical insights and teaching recommendations, addressing limitations of video-based methods that struggle with diverse outdoor settings and specialized technical movements.

AINeutralarXiv – CS AI · Jun 16/10

🧠

KnowledgeGain: Evaluating and Optimizing Science News Generation for Reader Learning

Researchers introduce KnowledgeGain, a metric that evaluates science news quality by measuring reader learning rather than semantic similarity. Validated through human studies, the metric uses an LLM reader simulator to identify articles that improve post-reading comprehension and knowledge retention aligned with Bloom's Taxonomy.

AINeutralarXiv – CS AI · May 296/10

🧠

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

Researchers introduce AgentSchool, an LLM-powered multi-agent simulator that models student learning through state transitions rather than simple role-play, featuring cognitively growable student agents with knowledge graphs and adaptive teachers operating within the Zone of Proximal Development. The system addresses the challenge of validating educational AI interventions in real classrooms by creating a configurable simulation environment that reproduces plausible learning outcomes and social dynamics without requiring institutional constraints or ethical compromises of live trials.

AINeutralarXiv – CS AI · May 296/10

🧠

Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance

Researchers propose a modular architecture for educational AI chatbots designed to enforce pedagogical principles and prevent negative learning outcomes. The approach addresses structural limitations in current monolithic LLM solutions by incorporating targeted modules at different exercise-solving stages, enabling more transparent and controlled student guidance.

AINeutralarXiv – CS AI · May 286/10

🧠

REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

Researchers propose REC-CBM, a novel machine learning model that combines concept bottleneck models with rubric-aware error correction to automate open-ended educational grading while maintaining transparency and interpretability. Unlike black-box LLM systems, REC-CBM allows educators to verify scoring decisions through human-interpretable concept reasoning, addressing the growing need for trustworthy automated grading in educational settings.

AINeutralarXiv – CS AI · May 286/10

🧠

KT4EQG: Personalized Exercise Question Generation via Knowledge Tracing

KT4EQG is a new educational framework that combines knowledge tracing with AI-powered question generation to create personalized exercise questions for students. The system uses machine learning to model each student's knowledge state and generates customized questions designed to maximize learning outcomes, demonstrating superior effectiveness compared to non-personalized approaches.

AINeutralarXiv – CS AI · May 286/10

🧠

From Learning Resources to Competencies: LLM-Based Tagging with Evidence and Graph Constraints

Researchers developed an LLM-based pipeline that automatically tags learning resources with competencies from structured frameworks, combining language models with graph constraints and evidence extraction. The system achieved strong performance metrics (0.57 micro-F1, 0.82 MRR) while providing transparent, auditable evidence spans—outperforming traditional baselines and addressing the labor-intensive challenge of manual resource tagging in educational systems.

AINeutralarXiv – CS AI · May 286/10

🧠

Memory-Based vs. Context-Only Conditioning Produces Distinct Behavioral Patterns in Stateful Personalization

Researchers compared two conditioning approaches in educational recommendation systems: context-based (using current student questions) versus memory-based (using persistent learner history). Memory-based conditioning produced more personalized, history-dependent behavior while context-based approaches showed stronger immediate responsiveness, suggesting that embedding-based similarity metrics alone are insufficient for capturing true personalization effects.

AINeutralarXiv – CS AI · May 125/10

🧠

MBP-KT: Learning Global Collaborative Information from Meta-Behavioral Pattern for Enhanced Knowledge Tracing

Researchers propose MBP-KT, a machine learning framework that improves knowledge tracing by extracting collaborative learning patterns from student interaction sequences. The method transforms raw data into meta-behavioral patterns and injects this global collaborative information into various knowledge tracing models, demonstrating consistent performance improvements across real-world datasets.

AINeutralarXiv – CS AI · May 126/10

🧠

Explainable Knowledge Tracing via Probabilistic Embeddings and Pattern-based Reasoning

Researchers introduce Probabilistic Logical Knowledge Tracing (PLKT), an interpretable AI framework that uses Beta-distributed probabilistic embeddings to model student knowledge states and predict learning performance. Unlike conventional deep learning approaches that rely on opaque deterministic embeddings, PLKT constructs transparent reasoning paths showing how past interactions influence predictions while maintaining superior accuracy compared to existing methods.

AINeutralarXiv – CS AI · May 116/10

🧠

Evaluating Prompt Injection Defenses for Educational LLM Tutors: Security-Usability-Latency Trade-offs

Researchers evaluated prompt-injection defenses for educational LLM tutors, revealing inherent trade-offs between security, usability, and speed. A multi-layer safeguard pipeline achieved 46.34% attack bypass with zero false positives and 2.50ms latency, while competing systems like NeMo Guardrails eliminated bypasses but suffered 16.22% false positive rates and 1.3-second delays.

Page 1 of 2Next →