🧠 AI🔴 BearishImportance 6/10

Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

arXiv – CS AI|Qilin Zhou, Zhuo Wang, Yue Li, W. K. Chan|June 9, 2026 at 04:00 AM

🤖AI Summary

Researchers evaluated how large language models (GPT and Grok) perform at grading graduate-level research reports, finding significant inconsistencies both within individual models and between different models. The study reveals that interaction history causes models to systematically drift from human grading standards, raising concerns about fairness in automated academic assessment.

Analysis

Academic institutions face mounting pressure to streamline assessment processes, particularly in graduate programs where reading and grading research reports demands substantial faculty time. This research addresses a critical gap in understanding LLM reliability for specialized educational tasks beyond standard benchmarks. The study's findings are sobering for institutions considering large-scale LLM adoption: while models demonstrate potential to reduce educator workload, their grading behavior diverges meaningfully from human expert judgment in ways that threaten assessment fairness.

The research distinguishes between two failure modes: intra-model consistency (a single model's variable performance on similar work) and inter-model consistency (disagreement between different LLM architectures). More concerning is the temporal drift phenomenon, where continuous conversation history systematically biases model grading away from established human standards. This suggests that using LLMs in iterative workflows—common in educational settings—may compound fairness problems rather than solve them.

For educational technology developers and institutions, these findings indicate that naive LLM deployment creates systemic bias risks that ensemble methods cannot resolve. The implications extend beyond grading: if LLMs drift under interaction history in specialized domains, similar vulnerabilities likely affect other high-stakes applications requiring consistent decision-making. Educators exploring automation should implement robust human-in-the-loop protocols and periodic recalibration against expert standards rather than assuming models maintain consistent evaluation criteria over time.

Key Takeaways

→LLMs show variable intra-model consistency and significant disagreement between models when grading academic work.
→Continuous interaction history causes models to systematically drift from human expert grading standards.
→Simple ensemble approaches fail to improve alignment with human evaluation in this specialized task.
→Indiscriminate LLM grading may introduce systemic unfairness despite reducing educator workload.
→Specific operational practices and human oversight are essential to mitigate disparities in automated grading.

Mentioned in AI

Models

GrokxAI

#llm-grading #educational-ai #model-consistency #fairness-bias #academic-assessment #gpt-evaluation #automated-grading

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge