#video-language-models News & Analysis

4 articles tagged with #video-language-models. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

4 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

Oracle-RLAIF: An Improved Fine-Tuning Framework for Multi-modal Video Models using Reinforcement Learning from Ranking Feedback

Researchers propose Oracle-RLAIF, a novel fine-tuning framework for video-language models that replaces expensive trained reward models with a general-purpose oracle ranker, paired with a new rank-based loss function (GRPO_rank). This approach significantly reduces the cost of gathering human feedback while improving performance across video comprehension benchmarks.

AIBullisharXiv – CS AI · Jun 87/10

🧠

MACD: Model-Aware Contrastive Decoding via Counterfactual Data

Researchers introduce MACD, a new inference strategy that reduces hallucinations in video language models by using the model's own feedback to identify problematic visual regions and generate targeted counterfactual data. The method combines model-aware object-level modifications with contrastive decoding, showing consistent improvements across multiple benchmarks and video-LLM architectures.

AIBullisharXiv – CS AI · May 116/10

🧠

TTF: Temporal Token Fusion for Efficient Video-Language Model

Researchers introduce Temporal Token Fusion (TTF), a training-free compression technique that reduces visual tokens in video-language models by 67% while maintaining 99.5% accuracy. The method addresses the critical bottleneck of LLM prefill costs in video understanding by identifying and fusing redundant tokens across video frames using local similarity matching.

AIBullisharXiv – CS AI · Mar 37/107

🧠

QuickGrasp: Responsive Video-Language Querying Service via Accelerated Tokenization and Edge-Augmented Inference

Researchers propose QuickGrasp, a video-language querying system that combines local processing with edge computing to achieve both fast response times and high accuracy. The system achieves up to 12.8x reduction in response delay while maintaining the accuracy of large video-language models through accelerated tokenization and adaptive edge augmentation.