Robust Zero-Shot Generalization for Open-Vocabulary Action Recognition via Task Arithmetic
Researchers propose a novel approach to Open Vocabulary Action Recognition (OVAR) using task arithmetic and model merging, enabling zero-shot generalization to novel actions without requiring costly domain-specific fine-tuning. By combining task vectors from models trained on diverse public datasets, the method achieves superior out-of-distribution performance while avoiding privacy and regulatory concerns associated with target-domain training.
This research addresses a fundamental challenge in computer vision: recognizing actions outside predefined classes without expensive retraining on target domains. Traditional OVAR systems rely on vision-language models but typically degrade when encountering distribution shifts in real-world deployments. The proposed approach leverages task arithmetic—a technique that extracts learned task-specific information as vectors and recombines them—to create a more robust merged model without accessing target-domain data.
The significance lies in its practical implications. Real-world video analysis applications across security, autonomous systems, and content moderation frequently encounter novel action types that weren't in training data. Requiring domain-specific fine-tuning introduces computational overhead, creates data privacy issues, and triggers regulatory complications under frameworks like GDPR. This work bypasses those constraints entirely by operating purely in a zero-shot paradigm.
The technical innovation demonstrates that strategic knowledge recombination from diverse public sources outperforms reliance on generic pretrained models. This finding has broader applications beyond action recognition—model merging techniques are increasingly relevant as organizations seek to combine specialized models without full retraining. For developers and researchers, this represents a cost-effective path to deployment that maintains privacy compliance.
Looking forward, the validation of task arithmetic in OVAR opens questions about scaling to larger model ensembles and other vision-language tasks. The availability of code facilitates community adoption and extension, potentially establishing this as a standard approach for zero-shot generalization challenges across multimodal AI systems.
- →Task arithmetic enables merging of diverse OVAR models to achieve superior zero-shot generalization without target-domain training.
- →The approach eliminates privacy and regulatory concerns associated with domain-specific fine-tuning on sensitive video data.
- →Knowledge recombination from public datasets outperforms single pretrained models in out-of-distribution action recognition scenarios.
- →Model merging techniques offer practical cost savings by avoiding expensive retraining cycles in production systems.
- →Open-source code release accelerates adoption of this paradigm for zero-shot generalization across vision-language tasks.