Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation
Researchers discovered that chain-of-thought distillation—training smaller AI models to imitate larger models' reasoning—produces higher answer accuracy on medical benchmarks while simultaneously degrading reasoning quality. A Qwen3-8B student model improved from 74.7% to 84.4% accuracy on MedQA-USMLE, yet error rates in individual reasoning steps jumped from 30.6% to 50.3%, suggesting models learn to mimic expert-like output without grounding claims in sound logic.
This study reveals a critical reliability gap in medical AI systems that optimizes for surface-level metrics while compromising foundational reasoning integrity. The researchers trained a smaller Qwen model to replicate a DeepSeek-V3 teacher's answers on medical questions, achieving impressive gains in accuracy and calibration error. However, blind audits by independent LLM judges and clinical experts exposed a troubling divergence: while final answers improved, the step-by-step reasoning traces deteriorated substantially, with error rates doubling in non-abstained steps.
This finding challenges the widespread assumption that answer-level metrics sufficiently validate model reasoning. In high-stakes domains like medicine, where clinical justification matters as much as diagnosis, students effectively learned to produce expert-appearing output while failing to reliably justify each claim. The effect persisted across multiple evaluators, teacher architectures, student sizes, and medical benchmarks—suggesting it reflects a systemic property of distillation rather than a specific implementation flaw. When compact medical answers under-constrain possible rationales, capable student models can reproduce correct outputs through pattern matching rather than genuine reasoning.
For practitioners deploying these systems, the implications are severe. If distilled models are released or reused in downstream applications, their reasoning traces cannot be trusted despite strong accuracy scores. Clinical systems, regulatory submissions, and educational uses of AI may rely on traces that look competent but contain substantial logical errors. The researchers identify the risk emerges specifically when answer format leaves clinical justification under-specified and students can imitate surface characteristics without grounding local claims—establishing boundary conditions that help predict when this failure mode activates.
- →Chain-of-thought distillation improves answer accuracy while degrading reasoning quality, creating a dangerous disconnect in medical AI applications.
- →Student models learn to mimic expert-like output through pattern matching rather than developing sound step-by-step reasoning.
- →Standard accuracy and calibration metrics fail to detect the degradation in reasoning trace quality, masking systematic failures.
- →The divergence occurs when compact answer formats leave clinical justification under-constrained, allowing students to bypass genuine grounding.
- →Releasing or reusing distilled reasoning traces without careful evaluation of individual step correctness poses significant risks in high-stakes domains.