🧠 AI🟢 BullishImportance 7/10

ViTok-v2: Scaling Native Resolution Auto-Encoders to 5 Billion Parameters

arXiv – CS AI|Philippe Hansen-Estruch, Jiahui Chen, Vivek Ramanujan, Orr Zohar, Yan Ping, Animesh Sinha, Markos Georgopoulos, Edgar Schoenfeld, Ji Hou, Felix Juefei-Xu, Sriram Vishwanath, Ali Thabet|May 9, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce ViTok-v2, a 5-billion-parameter Vision Transformer autoencoder that achieves native resolution support and stable scaling without adversarial losses. The breakthrough advances image tokenization for generative AI by improving reconstruction quality across multiple resolutions while maintaining generation capabilities.

Analysis

ViTok-v2 represents a significant advancement in image tokenization architecture, addressing fundamental limitations that have constrained Vision Transformer autoencoders. The core innovation involves replacing unstable GAN and LPIPS objectives with a DINOv3 perceptual loss, eliminating training instability that previously prevented scaling beyond certain model sizes. This shift enables the system to scale to 5 billion parameters—the largest image autoencoder documented to date—while maintaining reconstruction fidelity across variable resolutions and aspect ratios through NaFlex technology.

The research builds on earlier ViTok work identifying the compression ratio trade-off between reconstruction quality and generation difficulty. By scaling both the autoencoder and paired flow-matching generators jointly, the team demonstrates that improving tokenizer reconstruction meaningfully advances the Pareto frontier of this trade-off. Training on approximately 2 billion images provides substantial empirical grounding for the approach.

For the AI and generative modeling ecosystem, this work carries implications for multimodal model development and foundation model scaling. Superior image tokenization directly impacts downstream tasks in video generation, image synthesis, and vision-language models. The removal of adversarial training components simplifies the development pipeline and reduces training complexity for researchers implementing similar architectures.

The performance metrics—matching state-of-the-art at 256p resolution while outperforming baselines at 512p and higher—suggest growing practical utility as generative AI applications demand higher-resolution outputs. Future developments will likely focus on optimizing inference efficiency and exploring integration with larger language models.

Key Takeaways

→ViTok-v2 scales to 5B parameters, becoming the largest image autoencoder to date without using adversarial losses
→Native resolution support via NaFlex enables generalization across multiple resolutions and aspect ratios
→DINOv3 perceptual loss replaces both LPIPS and GAN objectives, improving training stability at scale
→Joint scaling of autoencoders and generators advances the reconstruction-generation quality trade-off frontier
→Superior image tokenization performance at 512p and above resolution improves utility for high-quality generative applications