Researchers propose CAT (Cross-scale Aligned Transformer), a new GAN training method that addresses the cross-scale trajectory misalignment problem in multi-stage image generation. By adding consistency regularization between intermediate and final outputs, CAT achieves state-of-the-art results on ImageNet-256 with one-step inference, reaching FID-50K of 1.56 after just 60 training epochs.
The paper identifies a fundamental architectural flaw in how modern GANs train multi-scale generators. Traditional approaches apply independent adversarial supervision at each resolution level, treating them as separate optimization targets. However, this strategy creates a critical problem: intermediate outputs can satisfy realism requirements at their own scale while diverging from the final output's sample identity, breaking the coarse-to-fine generation narrative that practitioners assume they're implementing.
This cross-scale trajectory misalignment represents a genuine limitation in the generative model literature. Prior work on progressive training and hierarchical synthesis assumed that scale-wise realism naturally enforces consistency across stages, but the researchers demonstrate this assumption is incorrect. The solution—adding generator-side consistency regularization—is elegantly simple yet powerful, maintaining scale-wise discriminator evaluation while enforcing alignment through the generator network itself.
The practical implications are substantial. Achieving FID-50K of 1.56 on ImageNet-256 with single-step inference after minimal training represents meaningful progress in efficient image generation. This matters for deployment scenarios where inference speed and training efficiency directly impact production viability. The one-step capability particularly challenges the recent dominance of diffusion models, which typically require dozens of sampling steps.
Looking forward, the methodology raises questions about whether similar trajectory misalignment affects other hierarchical generative models beyond GANs. The consistency regularization approach could potentially transfer to other architectures. Further investigation into whether CAT scales efficiently to higher resolutions and how it performs on diverse datasets beyond ImageNet will determine broader adoption potential.
- →Standard multi-scale GAN training fails to maintain consistent sample identity across stages, allowing intermediate outputs to diverge toward different samples rather than refine previous outputs.
- →CAT's consistency regularization on the generator side solves cross-scale alignment while preserving scale-wise discriminator evaluation, establishing a proper coarse-to-fine hierarchy.
- →One-step ImageNet-256 generation with FID-50K of 1.56 demonstrates competitive efficiency compared to recent diffusion and flow-based models.
- →The architectural insight applies broadly to any hierarchical generative approach, not just GANs, suggesting wider implications for generative model design.
- →Achieving strong results in 60 training epochs suggests CAT may reduce computational requirements for training high-quality generative models.