AINeutralarXiv – CS AI · 7h ago5/10
🧠
ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization
ConTrans, a novel neural network architecture, advances zero-shot temporal action localization by combining convolutional and transformer layers to capture both local frame dependencies and long-range video context. The approach achieves new benchmark performance on standard datasets, addressing limitations in existing methods that underutilize local correlations between frames.