CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
Researchers introduce CLR-voyance, a framework that treats inpatient clinical reasoning as a partially observable decision process with outcome-grounded rewards validated by clinicians. The resulting CLR-voyance-8B model outperforms GPT-5 and larger medical models on clinical benchmarks while maintaining generalist capabilities, and has been deployed in a hospital for six months.
CLR-voyance addresses a fundamental limitation in clinical AI evaluation: existing benchmarks often collapse complex sequential decision-making into static retrieval tasks or subjective scoring. By reformulating inpatient reasoning as a POMDP (Partially Observable Markov Decision Process), the framework acknowledges that clinicians act under genuine uncertainty, with incomplete information about patient futures. This conceptual shift is significant because it aligns AI evaluation with real clinical practice rather than idealized scenarios.
The technical approach is sophisticated. The framework partitions patient journeys into clinician-visible history and oracle-only futures, then uses this split to generate verifiable, case-specific rubrics that anchor evaluation in actual outcomes rather than expert opinion alone. This outcome-grounding tackles a perennial problem in medical AI: the tendency for LLM judges to score reasoning based on plausibility rather than clinical correctness. The post-training pipeline using GRPO and model merging on Qwen3-8B and MedGemma-4B demonstrates practical engineering at scale.
The clinician alignment study represents perhaps the most valuable contribution. By having physicians curate rubrics, grade responses, and provide pairwise preferences, the work generates insights into how clinical professionals actually evaluate reasoning—data that can inform the broader medical AI community beyond this specific system. The six-month hospital deployment validates practical utility, suggesting the framework produces clinically acceptable outputs in real settings.
For AI development, this work establishes a methodological template for evaluation-driven improvement in high-stakes domains. The superior performance of the 8B model against GPT-5 suggests that domain-specific training with rigorous evaluation frameworks can outperform scale alone—a pattern relevant to specialized AI applications across healthcare and other regulated industries.
- →CLR-voyance reformulates clinical reasoning as a POMDP with outcome-grounded, clinician-validated reward signals rather than closed-form evaluation
- →CLR-voyance-8B outperforms GPT-5 (84.91% vs 77.83%) and MedGemma-27B on clinical reasoning while maintaining generalist capabilities
- →The framework uses patient journey partitioning to generate verifiable, case-specific rubrics that anchor evaluation in actual patient outcomes
- →Large-scale clinician alignment study provides insights on LLM-as-judge evaluation and preference model selection applicable across medical AI
- →Six-month hospital deployment demonstrates practical viability and acceptability of the approach in real clinical settings