Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
Research reveals that AI models, particularly few-shot large language models, struggle significantly with mid-range quality responses in automated short answer scoring, while fine-tuned models and human experts maintain consistent performance across all quality levels. This degradation raises fairness concerns for students with developing understanding, emphasizing the need for quality-conditioned evaluation metrics.
The shift toward large language models in educational assessment represents a fundamental trade-off between deployment ease and scoring reliability. While LLMs offer broad knowledge and minimal setup friction, this study demonstrates their susceptibility to systematic scoring failures on nuanced, partially correct responses—precisely where human judgment matters most. The research involves rigorous testing across multiple models and hundreds of real student responses, making the findings particularly credible.
This mid-range degradation stems from insufficient task-specific adaptation. Few-shot LLMs trained on minimal examples cannot calibrate to domain-specific evaluation rubrics, particularly when responses require understanding subjective criteria in educational contexts. Conversely, fine-tuned BERT-based models that incorporate substantial training data maintain accuracy across all response qualities, suggesting that architectural choices alone cannot overcome data scarcity.
The implications extend beyond academic assessment to any automated scoring system—customer feedback analysis, quality control, or performance evaluation. Organizations deploying LLMs for consequential decisions face hidden fairness risks: systems that appear functional on clear-cut cases may systematically undervalue borderline cases, perpetuating inequitable outcomes. The research establishes quality-conditioned agreement as a critical evaluation metric previously overlooked in AI benchmarking.
Future development must balance efficiency with accuracy. Hybrid approaches—combining few-shot LLM convenience with task-specific fine-tuning on representative datasets—may offer optimal solutions. Educational institutions and assessment platforms should conduct similar quality-stratified analyses before deployment, particularly when evaluating student learning from diverse backgrounds.
- →Few-shot LLMs show significant performance degradation on mid-range quality responses compared to fully correct or incorrect answers.
- →Task-specific adaptation directly correlates with scoring consistency, with fine-tuned models outperforming few-shot LLMs.
- →Human experts maintain stable agreement across all response quality levels, establishing the benchmark for fair automated assessment.
- →Mid-range degradation in AI scoring may inequitably disadvantage students with developing understanding or from underrepresented groups.
- →Quality-conditioned fairness metrics are essential evaluation criteria for deploying AI in high-stakes assessment contexts.