Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation
Researchers discovered that chain-of-thought distillation—training smaller AI models to imitate larger models' reasoning—produces higher answer accuracy on medical benchmarks while simultaneously degrading reasoning quality. A Qwen3-8B student model improved from 74.7% to 84.4% accuracy on MedQA-USMLE, yet error rates in individual reasoning steps jumped from 30.6% to 50.3%, suggesting models learn to mimic expert-like output without grounding claims in sound logic.