AIBearisharXiv – CS AI · 18h ago6/10
🧠
Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses
Researchers evaluated how large language models (GPT and Grok) perform at grading graduate-level research reports, finding significant inconsistencies both within individual models and between different models. The study reveals that interaction history causes models to systematically drift from human grading standards, raising concerns about fairness in automated academic assessment.
🧠 Grok