Improving Heart-Focused Medical Question Answering in LLMs via Variance-Aware Rubric Rewards with GRPO
Researchers demonstrate that Group Relative Policy Optimization (GRPO) combined with a novel Variance-Aware Reward Framework significantly improves smaller LLMs' performance on medical question answering, particularly for heart-related queries. The approach achieves 38% accuracy improvement on a held-out test set while remaining competitive with much larger models, offering a practical path toward efficient, deployable medical AI systems.