Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
Researchers demonstrate that Group Relative Policy Optimization (GRPO) combined with a novel Variance-Aware Reward Framework significantly improves smaller LLMs' performance on medical question answering, particularly for heart-related queries. The approach achieves 38% accuracy improvement on a held-out test set while remaining competitive with much larger models, offering a practical path toward efficient, deployable medical AI systems.
This research addresses a critical gap in healthcare AI deployment. While large language models demonstrate impressive capabilities, their scale creates insurmountable obstacles for real-world medical settings: prohibitive inference costs, data privacy vulnerabilities, and incompatibility with edge computing environments. The study targets these constraints by optimizing smaller models through sophisticated post-training techniques rather than pursuing ever-larger architectures.
The innovation centers on how reward signals guide model training. Traditional approaches collapse multi-dimensional medical evaluation rubrics into single scores, losing information crucial for nuanced learning. The Variance-Aware Reward Framework preserves this granularity by deriving continuous analytical rewards from individual criteria rather than aggregating them prematurely. This richer feedback signal enables more effective reinforcement learning on sparse, complex medical reasoning tasks where ground truth is difficult to establish algorithmically.
The empirical results demonstrate substantial practical value. A 14B parameter model improved from 36.2% to 50.2% accuracy on heart-focused questions, nearly matching GPT-OSS-120B's 50.8% accuracy while using a fraction of computational resources. This efficiency gain has direct implications for healthcare deployment scenarios where latency, cost, and privacy constraints dominate decision-making.
The framework's extensibility to other rubric-based tasks suggests broader applicability across educational assessment, clinical decision support, and regulatory compliance domains. Future work likely involves testing this approach on non-medical specialized reasoning tasks and investigating how variance-aware rewards affect other reinforcement learning algorithms beyond GRPO.
- βVariance-Aware Reward Framework improves medical LLM accuracy by 38% by preserving multi-criteria rubric information during training.
- βA 14B parameter model achieves near-parity with much larger models on cardiology question answering, enabling practical healthcare deployment.
- βThe approach reduces computational inference costs and data privacy risks compared to deploying general-purpose large language models.
- βRubric-based reward design provides a practical methodology for improving performance on tasks with sparse, difficult-to-verify feedback.
- βThe framework potentially extends to other specialized domains requiring nuanced multi-criteria evaluation.