AIBullisharXiv – CS AI · 5h ago7/10
🧠
LLM Performance on a Real, Double-Marked GCSE Benchmark
Researchers tested large language models against human examiners on 32,534 real UK GCSE exam responses, finding that top-performing models achieve higher agreement with examiner consensus than examiners do with each other. The results demonstrate LLMs can reliably grade subjective tasks like essays and handle complex handwritten work, suggesting viable automated marking solutions.