"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise
Researchers propose a novel metric called 'Decan' for measuring diversity in AI-generated creative outputs using in-context learning and language model probabilities, achieving 84.6% accuracy on benchmark tests. The approach detects mode collapse and diversity loss across training stages without requiring specialized embedding models or human annotation, offering a practical tool for evaluating generative AI systems.
This research addresses a fundamental challenge in generative AI evaluation: quantifying diversity in creative outputs in a scalable, reproducible manner. Traditional diversity metrics often require expensive human annotation, specialized embedding models, or reference corpora, creating bottlenecks in model development and comparison. The Decan metric sidesteps these limitations by leveraging a language model's native probability distributions, enabling single-pass computation that scales efficiently across multiple samples and prompts.
The work emerges from growing concerns about mode collapse in post-training AI systems. As models undergo supervised fine-tuning (SFT), direct preference optimization (DPO), and reinforcement learning stages, they often converge toward repetitive, homogeneous outputs—a particular problem for creative applications like writing and design. The monotonic diversity decline the authors document across OLMo-2-7B's training pipeline demonstrates the metric's sensitivity to genuine distributional changes, suggesting it captures real quality degradation that users would perceive.
For developers and researchers, this metric provides immediate practical value: a computationally efficient way to monitor model behavior during training without external tooling. The 84.6% accuracy on human-grounded benchmarks indicates reasonable correlation with human judgment, though the gap to the 89.7% SentBERT baseline suggests room for refinement. The approach's information-theoretic foundation makes it interpretable and generalizable across different model architectures.
Looking forward, adoption hinges on whether the metric generalizes beyond the tested domains and whether practitioners find the performance-accuracy tradeoff acceptable compared to neural baselines. The work signals broader industry movement toward building evaluation infrastructure that matches training complexity.
- →Decan metric enables single-pass diversity measurement using language model probabilities without external models or human labels
- →The approach detects meaningful diversity loss across post-training stages, indicating sensitivity to genuine model behavior changes
- →Achieves 84.6% accuracy on human-grounded benchmarks, approaching but trailing specialized neural baselines
- →Computational efficiency and scalability make the metric practical for routine model evaluation during development
- →Information-theoretic grounding suggests strong generalizability across different architectures and creative domains