InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training
Researchers introduce ORBIT, a reinforcement learning framework that uses dynamically generated rubrics to fine-tune large language models for open-ended medical dialogue tasks. The approach achieves state-of-the-art performance on medical benchmarks with minimal training data, addressing the challenge of applying RL to complex tasks where traditional scalar reward signals are inadequate.
ORBIT represents a meaningful advancement in applying reinforcement learning to domains where feedback is inherently subjective and multidimensional. Traditional RL approaches struggle with open-ended medical dialogue because medical interactions require nuanced evaluation across multiple dimensions—accuracy, empathy, safety, and contextual appropriateness—making a single reward signal insufficient. The introduction of case-conditioned rubrics that adapt dynamically offers a principled alternative to hand-crafted rules or task-specific reward model training, reducing the supervision burden while maintaining quality control.
The framework's ability to achieve a 3.9x improvement in HealthBench-Hard scores using only 2,000 training samples demonstrates practical efficiency in a domain where labeled data is expensive and difficult to obtain. By leveraging general-purpose instruction-following models rather than requiring specialized medical domain experts for fine-tuning, ORBIT lowers the barrier to deployment in specialized fields. This approach has implications beyond medicine—any domain with complex, context-dependent evaluation criteria could potentially benefit from rubric-guided incremental training.
For the AI industry, this work suggests a viable path forward for RL applications in subjective domains. Rather than attempting to compress multidimensional feedback into scalar rewards or relying on heavily supervised external systems, using structured rubrics as intermediate representations preserves evaluation complexity while remaining computationally tractable. The method's success with smaller models like Qwen3-4B indicates that scale alone is not the limiting factor for performance in specialized applications, opening opportunities for resource-constrained deployment scenarios in healthcare and enterprise settings.
- →ORBIT uses dynamically generated rubrics as adaptive guides for incremental RL, eliminating reliance on handcrafted rules or task-specific reward model training.
- →The framework achieves 3.9x performance improvement on medical dialogue benchmarks with only 2,000 training samples, demonstrating data efficiency.
- →The approach works with general-purpose LLMs without requiring domain-specific fine-tuning, reducing implementation barriers for specialized applications.
- →Rubric-guided evaluation preserves multidimensional feedback complexity while maintaining computational tractability in subjective task domains.
- →Success with smaller models suggests specialized performance is achievable without massive scale, enabling resource-constrained deployments in healthcare.