🧠 AI⚪ NeutralImportance 6/10

Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams

arXiv – CS AI|Xiuxiu Tang, G. Alex Ambrose, Ying Cheng|April 15, 2026 at 04:00 AM

🤖AI Summary

Researchers evaluated GPT-4o's ability to score physics exam responses using rubric-assisted scoring, finding that AI reliability matches human inter-rater consistency when rubrics are well-structured and granular. The study reveals that clear rubric design matters far more than LLM configuration choices, with performance declining on ambiguous mid-range responses.

Analysis

This study addresses a critical gap in educational AI implementation by systematically testing how LLMs can assist with constructed-response scoring in STEM fields. The research moves beyond proof-of-concept demonstrations to examine the practical conditions that determine reliability, comparing GPT-4o performance against human instructors across authentic physics exams with varying response quality and clarity.

The findings emerge from growing pressure in education to scale assessment processes while maintaining consistency. Traditional handwritten STEM responses require significant instructor time and suffer from rater bias, particularly when assigning partial credit. LLMs promise efficiency gains, but implementing them requires understanding which design choices drive reliability. This study systematically isolates these variables—rubric granularity, prompt format, and temperature settings—revealing that structural clarity in rubrics dominates other factors.

The practical implications extend beyond physics education. The research demonstrates that AI-assisted scoring performs comparably to human reliability on high-confidence cases but struggles with ambiguous middle-ground responses. This pattern suggests limitations in how current LLMs handle nuanced judgment calls requiring contextual reasoning. For educational institutions and assessment platforms, the findings indicate that investing in rubric design yields better returns than fine-tuning LLM parameters.

The work establishes a replicable methodology for evaluating AI scoring across other STEM disciplines and subjects. Future implementations should prioritize checklist-based rubrics over holistic scoring frameworks and recognize that temperature adjustments provide minimal benefit compared to prompt clarity. The research provides a template for responsible AI adoption in high-stakes assessment contexts where transparency and consistency directly impact student outcomes.

Key Takeaways

→Rubric design quality matters more than LLM configuration choices for reliable AI-assisted scoring
→Fine-grained, checklist-based rubrics significantly outperform holistic scoring approaches
→AI reliability matches human inter-rater agreement on high and low-performing responses but declines on ambiguous mid-range cases
→Temperature settings have minimal impact on scoring consistency compared to prompt formatting
→Clear, skill-based rubrics enable transferable AI scoring implementation across STEM assessment contexts

Mentioned in AI

Models

GPT-4OpenAI

#llm-scoring #educational-ai #rubric-design #assessment-automation #gpt-4o #stem-education #ai-reliability #constructed-responses

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Designing Reliable LLM-Assisted Rubric Scoring for Constructed Responses: Evidence from Physics Exams

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge