AIBullisharXiv โ CS AI ยท 14h ago6/10
๐ง
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning
Researchers introduce BoxTuning, a novel approach for improving video understanding in multimodal AI models by rendering object bounding boxes directly onto video frames as visual prompts rather than encoding them as text tokens. The method achieves 87-93% reduction in text token usage while maintaining full temporal resolution, demonstrating superior performance on video question-answering tasks.