MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention
Researchers introduce MOSS-Video-Preview, a cross-attention architecture enabling real-time video understanding where models process frames continuously and revise answers as new information arrives. The approach achieves 5x speedup in time-to-first-token and 2.7x higher decoding throughput compared to decoder-only models, while maintaining competitive offline performance.