🧠 AI🟢 BullishImportance 7/10

Automated Creativity Evaluation of Language Models Across Open-Ended Tasks

arXiv – CS AI|Min Sen Tan, Zachary Kit Chun Choy, Syed Ali Redha Alsagoff, Nadya Yuki Wangsajaya, Mohor Banerjee, Swaagat Bikash Saikia, Alvin Chan|June 11, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce an automated, domain-agnostic framework for evaluating creativity in large language models across open-ended tasks. The approach uses semantic entropy to measure divergent creativity and a multi-agent judge system for convergent creativity, validated across problem-solving, research ideation, and creative writing domains.

Analysis

This research addresses a critical gap in AI development: the lack of standardized, scalable methods for measuring creativity in language models. Previous creativity metrics were task-specific and embedded domain assumptions, making cross-domain comparison difficult. The new framework decouples the evaluation apparatus from the creative task itself, enabling researchers to assess LLM creativity systematically without redesigning metrics for each application.

The framework's dual approach—measuring novelty/diversity through semantic entropy and task fulfillment through retrieval-based multi-agent judging—reflects an understanding that creativity requires both originality and practical value. Testing across qualitatively distinct domains (MacGyver problem-solving, HypoGen research ideation, BookMIA creative writing) demonstrates genuine generalizability rather than narrow optimization.

For the AI industry, this work establishes reproducible benchmarking standards that accelerate development of creative AI systems. As enterprises increasingly seek LLMs for ideation, content generation, and problem-solving tasks, standardized creativity metrics become essential for comparing models and tracking progress. The 60% efficiency improvement in convergent creativity assessment also reduces evaluation costs, making systematic creativity testing more accessible to researchers and developers.

The empirical findings—showing how model size, temperature, recency, and reasoning capability impact creative performance—provide actionable insights for practitioners tuning LLM deployments. Future research will likely focus on refining these metrics further and understanding whether creativity measurements correlate with real-world downstream task performance in commercial applications.

Key Takeaways

→Researchers developed the first domain-agnostic framework for automated LLM creativity evaluation, overcoming limitations of task-specific metrics.
→The framework measures divergent creativity via semantic entropy and convergent creativity through a novel multi-agent judge system with 60% improved efficiency.
→Validation across problem-solving, research ideation, and creative writing domains demonstrates genuine generalizability and scalability.
→Empirical analysis reveals how model properties like size, temperature, and reasoning directly impact creative performance.
→Standardized creativity metrics enable reproducible benchmarking and accelerate progress in developing creative AI systems.