Researchers introduce a framework for evaluating whether AI creative systems cause population-level diversity collapse, where individual output quality improves while collective idea similarity increases. Testing three frontier LLMs across creative tasks, the study finds they fall below diversity parity with humans and proposes design interventions to mitigate crowding effects at development time.
This research addresses a critical blind spot in AI evaluation: individual utility versus population-level value. Creative systems optimized for single-output quality can inadvertently homogenize the creative landscape when deployed at scale, reducing the collective value of generated ideas. The paper's innovation lies in proposing an ex ante evaluation protocol that benchmarks diversity collapse without requiring human-AI interaction data, instead using model-generated outputs compared against human baselines.
The framework models ideas as congestible resources and introduces measurable metrics—an excess-crowding coefficient and a human-relative diversity ratio—that quantify when AI-generated outputs exhibit problematic crowding. Testing across short stories, marketing slogans, and alternative-uses tasks reveals that current frontier LLMs consistently underperform humans on diversity metrics, suggesting widespread crowding risk across creative domains.
For developers and organizations deploying creative AI, this work shifts diversity from a theoretical concern to an actionable optimization target. The finding that crowding can be reduced through generation-protocol variants indicates that targeted design choices at development time can preserve diversity without sacrificing individual output quality. This is particularly relevant for industries relying on creative AI—marketing, content creation, design—where audience fatigue from homogeneous outputs diminishes competitive advantage.
The broader implication challenges the prevailing single-output optimization paradigm in AI development. As creative systems become more prevalent, population-aware evaluation frameworks may become standard practice, potentially reshaping how companies benchmark model performance. The research establishes a methodological foundation for addressing diversity collapse as systems scale.
- →Current frontier LLMs fail diversity parity tests compared to humans across multiple creative tasks, creating population-level crowding risks.
- →The proposed evaluation framework measures crowding ex ante using only model outputs and human baselines, without requiring interaction data.
- →Generation protocol design choices can meaningfully reduce diversity collapse, making it an addressable engineering problem rather than an inevitable tradeoff.
- →Diversity collapse represents an economic problem where homogeneous AI outputs reduce collective value in crowded idea markets.
- →The framework establishes diversity as a measurable development-time target alongside traditional output quality metrics.