MedAction: Towards Active Multi-turn Clinical Diagnostic LLMs
Researchers introduce MedAction, a new framework and dataset designed to improve how large language models perform clinical diagnosis by simulating real-world multi-turn diagnostic processes. The approach addresses fundamental limitations in current medical LLMs through a tree-structured distillation pipeline that generates high-quality diagnostic trajectories, achieving state-of-the-art performance among open-source models.
The MedAction research addresses a critical gap between how medical LLMs are trained and how clinical diagnosis actually works in practice. Traditional evaluations present models with complete patient information in single interactions, whereas real diagnosis unfolds iteratively—clinicians form initial hypotheses, order targeted tests, interpret results, and refine their differential diagnoses across multiple turns. Current models fail at this process through three documented failure modes: ordering tests without clinical grounding, updating diagnoses unreliably, and losing coherence across multiple interactions.
This work represents a maturation of medical AI development, moving beyond static benchmarks toward dynamic, interactive problem-solving. The researchers developed a systematic approach using knowledge-graph-grounded metrics—Disease Trajectory Consistency and Reasoning-Action Consistency—to ensure generated training data maintains logical coherence. By synthesizing 32,681 multi-turn trajectories from 2,896 medical cases, they created MedAction-32K, a purpose-built dataset addressing the specific deficits in existing medical training corpora.
For the medical AI industry, this approach signals an important methodological shift. Rather than relying solely on existing medical literature or generic instruction-tuning, effective clinical LLMs require trajectories that capture the temporal, evidential logic of diagnosis. The fine-tuned 8B model achieving state-of-the-art open-source performance demonstrates that architectural sophistication matters less than training on appropriately structured diagnostic reasoning. This pattern likely extends beyond diagnosis to other clinical decision-making domains.
The research establishes benchmarks for evaluating multi-turn clinical reasoning, which will influence how future medical LLMs are developed and validated. The public MedAction-300-Hard benchmark provides a rigorous standard for assessing whether models understand diagnostic coherence, not just medical facts.
- →MedAction introduces a tree-structured pipeline that generates multi-turn diagnostic trajectories addressing core failures in current medical LLMs.
- →Knowledge-graph-grounded metrics (DTC and RAC) ensure training data maintains logical consistency between clinical evidence and diagnostic updates.
- →The MedAction-32K dataset of 32,681 trajectories demonstrates that appropriately structured training data significantly improves open-source medical LLM performance.
- →Current medical LLMs fail because training data emphasizes reasoning from complete information rather than decision-making under evolving, partial evidence.
- →The approach represents a methodological shift from static evaluation toward dynamic, interactive clinical reasoning benchmarking.