Variable-Length Tokenization via Learnable Global Merging for Diffusion Transformers
Researchers propose a novel variable-length tokenizer using learnable global merging to improve the quality-compute trade-off in latent diffusion models. Unlike conventional truncation-based approaches, the merging method maintains representational alignment across different compression levels, enabling diffusion transformers to operate more effectively with adaptive token counts.
The research addresses a fundamental constraint in visual synthesis: latent diffusion models must choose between fixed compression ratios that either sacrifice quality for speed or consume excessive compute for high fidelity. Variable-length tokenizers theoretically enable adaptive compression, but existing approaches suffer from a critical problem—truncating token sequences makes token semantics position-dependent, creating distribution shifts that prevent a single model from handling multiple lengths effectively. This paper's core innovation is replacing truncation with token merging, where similar tokens combine rather than disappear. The learnable global merging approach makes the process data-independent, ensuring the merging pattern remains consistent and predictable during generation rather than varying based on input. This architectural choice preserves semantic relationships across different compression levels, allowing diffusion transformers to maintain stable performance whether operating with many or few tokens. On ImageNet benchmarks, the method demonstrates superior trade-offs between generative quality (gFID scores) and computational cost compared to previous variable-length tokenizer approaches. The availability of open-source code accelerates potential adoption. This work matters for the broader AI infrastructure sector because efficient visual synthesis directly impacts applications ranging from real-time content creation to resource-constrained environments. The ability to dynamically balance quality and compute on a per-generation basis could enable more practical deployment of diffusion models in production systems. Developers working with visual generation pipelines should monitor whether this merging approach becomes standard practice across popular diffusion model implementations.
- →Learnable global merging preserves token semantics across variable compression levels by combining similar tokens instead of truncating sequences.
- →The method achieves superior quality-compute trade-offs on ImageNet 256×256 generation compared to prior variable-length tokenizer methods.
- →Data-independent merging patterns ensure consistency during generation, enabling stable diffusion transformer operation across different token counts.
- →Open-source code availability accelerates potential adoption in visual synthesis applications and production systems.
- →The approach addresses a fundamental constraint in latent diffusion models that previously required choosing between fixed quality or fixed compute budgets.