Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors
Researchers tested a dual-architecture LLM-based automated scoring system for educational assessments and found it generally robust to construct-irrelevant factors like meaningless text padding and spelling errors. The study shows promise for LLM-based scoring systems' reliability when properly designed, though off-topic responses were heavily penalized.