Video Understanding Reward Modeling: A Robust Benchmark and Performant Reward Models
Researchers introduce Video Understanding Reward Bench (VURB), a comprehensive benchmark with 2,100 preference pairs for evaluating video reward models, alongside VUP-35K, a large-scale dataset of 35,000 preference examples. Two new models, VideoDRM and VideoGRM, achieve state-of-the-art performance on video understanding tasks, advancing multimodal AI capabilities beyond text and images.
The development of robust video understanding reward models addresses a critical gap in multimodal AI research. While reward models have matured significantly for text and images, video remains underdeveloped due to insufficient benchmarks and preference data. This research tackles both problems simultaneously through a unified framework that combines benchmark design, automated data construction, and model training.
Video understanding presents unique challenges compared to text or images because it requires temporal reasoning across frames and understanding complex sequences. The introduction of VURB with 2,100 preference pairs and detailed chain-of-thought reasoning traces (averaging 1,143 tokens) provides the first comprehensive evaluation standard for this domain. The VUP-35K dataset, created through a fully automated pipeline, scales training data to 35,000 examples without sacrificing quality—a significant engineering achievement that reduces manual annotation bottlenecks.
The dual-model approach with VideoDRM (discriminative) and VideoGRM (generative) reward models offers flexibility for different use cases. Discriminative models excel at pairwise comparisons, while generative models can produce explanations alongside scores. Both achieving state-of-the-art results suggests the quality of VUP-35K training data is exceptional. The best-of-N scaling results indicate these models can substantially improve downstream video understanding systems through test-time computation.
This work has immediate implications for AI developers building video understanding systems, from content recommendation to autonomous video analysis. It establishes foundational infrastructure that the research community can build upon, similar to how benchmark datasets have historically accelerated progress in language models and computer vision.
- →VURB benchmark with 2,100 preference pairs establishes first comprehensive evaluation standard for video reward models with reasoning traces
- →VUP-35K dataset of 35,000 examples created via automated pipeline enables large-scale training without manual annotation bottlenecks
- →VideoDRM and VideoGRM achieve state-of-the-art performance, with discriminative and generative approaches offering complementary strengths
- →Reward models demonstrate significant gains under best-of-N test-time scaling, improving downstream video understanding systems
- →Research addresses critical infrastructure gap in multimodal AI where video understanding lagged behind text and image domains