Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation
Researchers present DAVE, a training-free method that enhances diversity in text-to-image generation by attenuating the DC (zero-frequency) component of intermediate Transformer features during early generation stages. The technique addresses the problem of identical outputs from the same prompt without requiring expensive sampling overhead or auxiliary optimization.
Text-to-image models have achieved remarkable capabilities in generating high-quality visuals from natural language prompts, yet they suffer from a critical limitation: semantic lock-in that produces nearly identical images from repeated prompts. This research identifies the specific mechanism driving this homogeneity by analyzing intermediate Transformer representations, discovering that spatial averaging components converge too rapidly across different random seeds early in the generation process. This premature convergence constrains downstream variation and eliminates the model's capacity to explore diverse visual interpretations of the same text.
The proposed DAVE method represents an elegant solution that operates at the representation level rather than through external mechanisms. By selectively attenuating the DC component during early generation phases, the approach preserves the integrity of existing sampling pipelines while introducing computational negligibility. This training-free intervention contrasts sharply with previous diversity-enhancement techniques that impose substantial computational burdens through expensive sampling strategies or auxiliary optimization loops.
For the AI development community, this finding has significant implications for generative model architecture and optimization. The research demonstrates that understanding failure modes at the representation level can yield lightweight, efficient solutions that don't compromise quality. Developers deploying text-to-image systems gain a practical tool for production environments where both diversity and computational efficiency matter. The technique's compatibility with existing models suggests broad applicability across different Transformer-based architectures. Looking ahead, similar representation-level analysis could reveal other bottlenecks in generative models, potentially spawning a new class of surgical interventions that improve model behavior without extensive retraining.
- βDAVE identifies DC component convergence as the root cause of text-to-image model homogeneity
- βThe training-free method requires negligible computational overhead while maintaining image quality
- βEarly trajectory lock-in in Transformer features fundamentally limits downstream diversity
- βRepresentation-level interventions offer efficient alternatives to expensive sampling-based diversity methods
- βThe technique is compatible with existing production pipelines without architectural modifications