#automated-grading News & Analysis

8 articles tagged with #automated-grading. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

8 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

LLM Performance on a Real, Double-Marked GCSE Benchmark

Researchers tested large language models against human examiners on 32,534 real UK GCSE exam responses, finding that top-performing models achieve higher agreement with examiner consensus than examiners do with each other. The results demonstrate LLMs can reliably grade subjective tasks like essays and handle complex handwritten work, suggesting viable automated marking solutions.

AINeutralarXiv – CS AI · Jun 256/10

🧠

HG-Bench: A Benchmark for Multi-Page Handwritten Answer-Region Grounding in Automated Homework Assessment

Researchers introduce HG-Bench, a benchmark dataset of 500 annotated homework samples for evaluating automated grading systems' ability to locate and decompose handwritten student answers across multiple pages. Current AI models, including frontier VLMs, achieve less than 55% accuracy on complete answer localization, revealing a significant capability gap in understanding spatial reasoning structures in handwritten documents.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Towards Fully Automated Exam Grading: Fairness-Aware Recognition of Handwritten Answers with Foundation Models

Researchers demonstrate that vision-language foundation models can achieve 98.4% accuracy in automatically grading handwritten exam answers, compared to previous methods' 88-91%. The approach prioritizes fairness by minimizing false negatives that disadvantage students and shows promise for scalable, automated exam grading without sacrificing pedagogical quality.

🏢 Meta

AIBearisharXiv – CS AI · Jun 96/10

🧠

Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

Researchers evaluated how large language models (GPT and Grok) perform at grading graduate-level research reports, finding significant inconsistencies both within individual models and between different models. The study reveals that interaction history causes models to systematically drift from human grading standards, raising concerns about fairness in automated academic assessment.

🧠 Grok

AINeutralarXiv – CS AI · May 286/10

🧠

REC-CBM: Rubric-Aware Error-Correction Concept Bottleneck Models for Trustworthy Open-Ended Grading

Researchers propose REC-CBM, a novel machine learning model that combines concept bottleneck models with rubric-aware error correction to automate open-ended educational grading while maintaining transparency and interpretability. Unlike black-box LLM systems, REC-CBM allows educators to verify scoring decisions through human-interpretable concept reasoning, addressing the growing need for trustworthy automated grading in educational settings.

AIBullisharXiv – CS AI · Mar 166/10

🧠

Human-in-the-Loop LLM Grading for Handwritten Mathematics Assessments

Researchers developed a human-in-the-loop LLM system for grading handwritten mathematics assessments that reduces grading time by 23% while maintaining accuracy comparable to manual grading. The system combines automated scanning, multi-pass LLM scoring, consistency checks, and mandatory human verification to handle pen-and-paper tests at scale.

AINeutralarXiv – CS AI · Mar 34/106

🧠

Confusion-Aware Rubric Optimization for LLM-based Automated Grading

Researchers introduce CARO (Confusion-Aware Rubric Optimization), a new framework that improves LLM-based automated grading by using confusion matrices to separate and fix specific error patterns instead of aggregating all errors together. This approach prevents conflicting constraints and significantly outperforms existing methods in teacher education and STEM datasets.

AINeutralarXiv – CS AI · Mar 34/106

🧠

Optimizing In-Context Demonstrations for LLM-based Automated Grading

Researchers introduce GUIDE, a new framework for improving automated grading of student responses using large language models. The system addresses key limitations in current LLM-based grading by optimizing the selection of training examples and generating better explanations for scoring decisions.