y0news
← Feed
←Back to feed
🧠 AIβšͺ NeutralImportance 6/10

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

arXiv – CS AI|Peng Zhang, Guanghao Zhang, Wanggui He, Longxiang Zhang, Mushui Liu, Yan Xia, Zhenhao Peng, Weilong Dai, Jinlong Liu, Haobing Tang, Le Zhang, Hao Jiang, Pipei Huang|
πŸ€–AI Summary

Researchers introduce DynFrame, an advanced video understanding framework that enables multimodal language models to dynamically select both temporal windows and frame sampling rates during inference. The approach achieves competitive performance with smaller 4B models against larger 7B-8B baselines and sets new state-of-the-art results with its 8B variant across six video understanding benchmarks.

Analysis

DynFrame addresses fundamental inefficiencies in how video multimodal large language models retrieve and process visual information. Traditional approaches require models to make fixed-frame-rate retrieval decisions, forcing repeated calls to capture fine-grained evidence and bloating inference context lengths. By tokenizing both the temporal window and sampling density as learnable parameters, DynFrame enables single-step retrieval of multi-granularity evidence, substantially reducing computational overhead and model complexity.

The framework's innovation extends beyond technical efficiency to credit assignment mechanisms. Existing video MLLMs optimize retrieval and answer generation together, treating all tokens equally regardless of their role. Segment-Decoupled GRPO (SD-GRPO) decouples this optimization by splitting rollouts at retrieval boundaries and assigning role-specific advantages, allowing the model to separately credit sampling decisions versus reasoning quality. This fine-grained optimization improves both retrieval accuracy and answer generation.

DynFrame's competitive results demonstrate that architectural innovation can bridge the performance gap between model sizes. A 4B parameter version matches capabilities of 7B-8B models, suggesting that parameter efficiency through smarter design may become increasingly valuable as video understanding demands grow. The 8B variant's state-of-the-art performance across NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, and LVBench indicates robust generalization across different video understanding tasks.

The open-source release positions DynFrame as a foundational contribution that could accelerate development of more efficient multimodal systems. As video data becomes central to AI applications, techniques that reduce computational requirements while maintaining quality will drive broader adoption.

Key Takeaways
  • β†’DynFrame enables learnable frame sampling density alongside temporal window selection within a single autoregressive retrieval pass, reducing inference overhead.
  • β†’Segment-Decoupled GRPO assigns separate token-level advantages to retrieval and reasoning decisions, improving credit assignment in training.
  • β†’DynFrame-4B matches larger 7B-8B baselines across multiple benchmarks, demonstrating parameter efficiency through architectural innovation.
  • β†’The framework achieves state-of-the-art results on six video understanding benchmarks including NExT-GQA, ActivityNet-MR, and Video-MME.
  • β†’Open-source code availability enables rapid adoption and further development of efficient multimodal video understanding systems.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles