RewardHarness: Self-Evolving Agentic Post-Training
RewardHarness introduces a self-evolving agentic framework that dramatically improves reward modeling for image-editing evaluation using only 0.05% of typical training data. By iteratively refining tools and skills from minimal examples rather than large-scale annotations, the system achieves 47.4% accuracy on benchmarks, outperforming GPT-5 and enabling more efficient AI alignment.
RewardHarness addresses a fundamental efficiency problem in AI training: the data-annotation bottleneck that separates human learning from machine learning. Traditional reward models require hundreds of thousands of preference comparisons to align with human judgment, yet humans infer evaluation criteria from mere examples. This work demonstrates that reward modeling can shift from weight optimization—the conventional supervised learning approach—to context evolution, where an orchestrator dynamically selects relevant tools and reasoning strategies without retraining underlying models.
The framework's architecture reflects advances in agentic AI systems. Rather than monolithic reward networks, RewardHarness maintains an evolving library of tools and skills that a frozen sub-agent chains together to produce judgments. The orchestrator learns which combinations work by comparing predictions against ground truth, automatically refining selections without additional human annotation. This mirrors broader trends in AI where modular, compositional reasoning proves more flexible and efficient than end-to-end training.
The results carry significant implications for both AI development and deployment. Achieving competitive performance with 0.05% of standard training data reduces computational overhead, annotation costs, and environmental impact—critical considerations for scaling AI systems. The 3.52 ImgEdit-Bench score when integrated with GRPO fine-tuning suggests practical utility beyond benchmarks. This efficiency gain matters for developers building reward models for novel domains where large-scale preference data remains unavailable or expensive to obtain.
The work opens questions about generalizing this approach beyond image editing to other multimodal tasks and whether similar context-evolution strategies could optimize other model components currently trained through massive-scale supervision.
- →RewardHarness achieves competitive reward modeling using only 0.05% of typical preference annotation data through iterative tool and skill refinement.
- →The framework reframes reward learning as context evolution via dynamic tool selection rather than weight optimization, enabling frozen models to improve.
- →Performance surpasses GPT-5 by 5.3 points on image-editing benchmarks and yields 3.52 on ImgEdit-Bench when used for RLHF fine-tuning.
- →This approach significantly reduces computational and annotation costs while maintaining or exceeding accuracy, addressing scalability challenges in AI alignment.
- →The modular, agentic architecture suggests broader applicability beyond image editing to other domains with limited preference data.