y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

From Scoring to Explanations: Evaluating SHAP and LLM Rationales for Rubric-based Teaching Quality Assessment

arXiv – CS AI|Ivo Bueno, Babette B\"uhler, Philipp Stark, Tim F\"utterer, Ulrich Trautwein, Dorottya Demszky, Heather Hill, Enkelejda Kasneci|
🤖AI Summary

Researchers propose a framework combining SHAP explainability with LLM-generated rationales to improve transparency in automated rubric-based scoring systems for educational assessment. Testing on classroom transcripts reveals fine-tuned language models outperform LLMs in accuracy, but SHAP attributions provide more faithful and transferable explanations than LLM rationales across different model architectures.

Analysis

This research addresses a critical gap in educational technology: the black-box problem of automated scoring systems. While AI increasingly evaluates complex human performance like teaching quality, stakeholders rarely understand why a particular score was assigned. The proposed framework tackles this by layering two explanation methods—Shapley value attributions (SHAP) from game theory and natural language rationales from large language models—enabling direct comparison of their explanatory power.

The study builds on growing recognition that explainability matters in high-stakes educational settings where assessment decisions affect teachers and students. Prior work focused on either scoring accuracy or post-hoc explanations separately; this research systematically evaluates both simultaneously across 6,000 annotated transcript segments from the NCTE corpus.

The findings reveal a meaningful trade-off: fine-tuned pretrained language models achieve superior scoring accuracy but compress predictions toward middle-range scores, potentially masking genuine performance variation. Critically, deletion tests demonstrate SHAP identifies genuinely influential sentences that reliably drive predictions, while LLM-generated explanations often fail to identify causal factors and vary inconsistently across models. SHAP attributions transfer robustly to different architectures, suggesting deeper model understanding rather than task-specific artifacts.

For educational technology developers and policymakers, this suggests SHAP-based explanations should become standard in assessment systems. The framework's transferability across architectures indicates it could scale to other rubric-based evaluation domains beyond teaching—performance review systems, skills assessment, or content moderation. Organizations deploying automated scoring should prioritize SHAP over LLM rationales when transparency and fidelity matter, particularly in regulatory or accountability contexts where decision justifications face scrutiny.

Key Takeaways
  • SHAP attributions provide more faithful and transferable explanations than LLM-generated rationales for rubric-based scoring systems.
  • Fine-tuned language models outperform prompted LLMs in prediction accuracy but exhibit problematic label compression toward mid-range scores.
  • Deletion-based testing confirms SHAP identifies causally important sentences while LLM rationales show limited and inconsistent influence on predictions.
  • SHAP explanations transfer robustly across different model architectures, while LLM rationales remain model-specific and unreliable.
  • The framework provides a principled evaluation methodology for high-stakes educational assessment and other rubric-based language evaluation tasks.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles