AIBullisharXiv – CS AI · 10h ago7/10
🧠
HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding
Researchers introduce HY-Himmel, a hierarchical video-language framework that efficiently processes long videos by separating semantic and motion encoding tasks. The system uses sparse keyframes for visual grounding while a lightweight adapter extracts motion information from compressed video data, achieving better performance than dense-frame baselines while reducing token usage by 3.6x.