#automated-scoring News & Analysis

5 articles tagged with #automated-scoring. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles

AIBearisharXiv – CS AI · May 117/10

🧠

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

Research reveals that AI models, particularly few-shot large language models, struggle significantly with mid-range quality responses in automated short answer scoring, while fine-tuned models and human experts maintain consistent performance across all quality levels. This degradation raises fairness concerns for students with developing understanding, emphasizing the need for quality-conditioned evaluation metrics.

🧠 GPT-4🧠 GPT-5🧠 Claude

AINeutralarXiv – CS AI · May 16/10

🧠

The Impact of LLM Self-Consistency and Reasoning Effort on Automated Scoring Accuracy and Cost

Researchers analyzing LLM-based automated scoring found that strategic model selection and reasoning configurations outperform ensemble methods for accuracy. Temperature sampling improved performance, but larger ensemble sizes showed diminishing returns, while higher reasoning effort correlated with better accuracy at varying cost-benefit ratios across model families.

🏢 OpenAI🧠 GPT-5🧠 Gemini

AINeutralarXiv – CS AI · Mar 26/1019

🧠

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

Researchers developed BRIDGE, a framework to reduce bias in AI-powered automated scoring systems that unfairly penalize English Language Learners (ELLs). The system addresses representation bias by generating synthetic high-scoring ELL samples, achieving fairness improvements comparable to using additional human data while maintaining overall performance.

AIBullisharXiv – CS AI · Mar 274/10

🧠

Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

Researchers tested a dual-architecture LLM-based automated scoring system for educational assessments and found it generally robust to construct-irrelevant factors like meaningless text padding and spelling errors. The study shows promise for LLM-based scoring systems' reliability when properly designed, though off-topic responses were heavily penalized.

AINeutralarXiv – CS AI · Mar 95/10

🧠

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

Research demonstrates that ChatGPT can code communication data with accuracy comparable to human raters while maintaining consistency across different demographic groups including gender and racial/ethnic categories. The study introduces three evaluation checks for assessing subgroup consistency in LLM-based coding systems for large-scale collaboration assessments.

🧠 ChatGPT