DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding
Researchers introduce DynFrame, an advanced video understanding framework that enables multimodal language models to dynamically select both temporal windows and frame sampling rates during inference. The approach achieves competitive performance with smaller 4B models against larger 7B-8B baselines and sets new state-of-the-art results with its 8B variant across six video understanding benchmarks.
DynFrame addresses fundamental inefficiencies in how video multimodal large language models retrieve and process visual information. Traditional approaches require models to make fixed-frame-rate retrieval decisions, forcing repeated calls to capture fine-grained evidence and bloating inference context lengths. By tokenizing both the temporal window and sampling density as learnable parameters, DynFrame enables single-step retrieval of multi-granularity evidence, substantially reducing computational overhead and model complexity.
The framework's innovation extends beyond technical efficiency to credit assignment mechanisms. Existing video MLLMs optimize retrieval and answer generation together, treating all tokens equally regardless of their role. Segment-Decoupled GRPO (SD-GRPO) decouples this optimization by splitting rollouts at retrieval boundaries and assigning role-specific advantages, allowing the model to separately credit sampling decisions versus reasoning quality. This fine-grained optimization improves both retrieval accuracy and answer generation.
DynFrame's competitive results demonstrate that architectural innovation can bridge the performance gap between model sizes. A 4B parameter version matches capabilities of 7B-8B models, suggesting that parameter efficiency through smarter design may become increasingly valuable as video understanding demands grow. The 8B variant's state-of-the-art performance across NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, and LVBench indicates robust generalization across different video understanding tasks.
The open-source release positions DynFrame as a foundational contribution that could accelerate development of more efficient multimodal systems. As video data becomes central to AI applications, techniques that reduce computational requirements while maintaining quality will drive broader adoption.
- βDynFrame enables learnable frame sampling density alongside temporal window selection within a single autoregressive retrieval pass, reducing inference overhead.
- βSegment-Decoupled GRPO assigns separate token-level advantages to retrieval and reasoning decisions, improving credit assignment in training.
- βDynFrame-4B matches larger 7B-8B baselines across multiple benchmarks, demonstrating parameter efficiency through architectural innovation.
- βThe framework achieves state-of-the-art results on six video understanding benchmarks including NExT-GQA, ActivityNet-MR, and Video-MME.
- βOpen-source code availability enables rapid adoption and further development of efficient multimodal video understanding systems.