#video-qa News & Analysis

3 articles tagged with #video-qa. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

3 articles

AINeutralarXiv – CS AI · Jun 116/10

🧠

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

ProcessThinker introduces a novel post-training method for multimodal large language models that provides step-level process rewards without requiring explicit reward model training. By using rollout-based sampling to verify intermediate reasoning steps, the approach improves visual question answering across multiple benchmarks while reducing computational overhead compared to traditional process reward models.

AIBullisharXiv – CS AI · Apr 146/10

🧠

BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

Researchers introduce BoxTuning, a novel approach for improving video understanding in multimodal AI models by rendering object bounding boxes directly onto video frames as visual prompts rather than encoding them as text tokens. The method achieves 87-93% reduction in text token usage while maintaining full temporal resolution, demonstrating superior performance on video question-answering tasks.

AIBullisharXiv – CS AI · Mar 175/10

🧠

Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

Researchers developed a question-aware keyframe selection framework for video question answering that uses large multimodal models to generate pseudo labels and coverage regularization. The method significantly improves accuracy on temporal and causal questions in the NExT-QA dataset, making video analysis more efficient by reducing inference costs.