AINeutralarXiv – CS AI · 3h ago5/10
🧠
Mining Multi-Modality Spatio-Temporal Cues for Video Important Person Identification
Researchers introduce the Video Important Person (VIP) identification task and Temporal-VIP dataset to automatically identify key individuals in video scenes while addressing the Temporal Importance Shift phenomenon. The VIP-Net framework achieves 67.3% accuracy, significantly outperforming existing methods (37.5%-53.9%), with applications in automated video editing and intelligent surveillance.
🏢 Hugging Face