🧠 AI🟢 BullishImportance 7/10

AdaCodec: A Predictive Visual Code for Video MLLMs

arXiv – CS AI|Haowen Hou, Zhen Huang, Zheming Liang, Qingyi Si, Chenglin Li, Shuai Dong, Kele Shao, Ruilin Li, Dianyi Wang, Nan Duan, Jiaqi Wang|June 2, 2026 at 04:00 AM

🤖AI Summary

AdaCodec introduces a predictive visual coding approach for video multimodal large language models that adaptively allocates visual tokens based on scene complexity. Rather than encoding each frame independently as RGB images, the system sends full reference frames only when scenes are unpredictable and uses compact tokens for inter-frame changes, achieving superior performance at 1/7th the token budget while reducing latency significantly.

Analysis

AdaCodec addresses a fundamental inefficiency in how video MLLMs process temporal data. Traditional approaches treat each sampled frame as an independent image, creating substantial redundancy since adjacent frames typically share similar objects, backgrounds, and spatial layouts. This research recognizes that video's inherent temporal structure can be exploited through predictive coding, where full visual tokens are reserved for high-uncertainty scenes while inter-frame deltas—including motion vectors and prediction residuals—are compressed into minimal P-tokens.

The development reflects broader trends in efficient AI model design, where researchers increasingly focus on reducing computational overhead without sacrificing capability. Video understanding has become critical for modern MLLMs, yet the token budget constraints of transformers make naive frame-by-frame encoding impractical for long-form content. AdaCodec's conditional approach represents a paradigm shift: allocating compute dynamically based on information content rather than frame position.

The empirical results carry significant implications for deployment and accessibility. Achieving baseline performance at 1/7th token consumption dramatically reduces computational requirements and latency—dropping time-to-first-token from 9.26 seconds to 1.62 seconds. This improvement matters substantially for real-time applications and edge deployment scenarios. Developers working with video understanding tasks gain a more efficient alternative to current frame-sampling strategies, potentially unlocking longer-context video analysis without proportional cost increases.

Future work will likely examine how similar predictive coding principles apply to other modalities and whether adaptive token allocation becomes standard practice in multimodal architectures. The approach also suggests opportunities for specialized hardware optimization around prediction residual encoding.

Key Takeaways

→AdaCodec uses adaptive token allocation, sending full frames only for unpredictable scenes and compact tokens for inter-frame changes
→The system achieves baseline performance using only 1/7th of the visual tokens on long-video benchmarks
→Time-to-first-token latency improves from 9.26 seconds to 1.62 seconds, critical for real-time applications
→Performance improvements span across all eleven tested benchmarks when compared to per-frame RGB baselines
→Predictive visual coding represents a shift toward content-aware compute allocation in video MLLMs