NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs
Researchers have developed NICE, a theory-grounded diagnostic benchmark for evaluating the social intelligence of large language models, organizing social abilities into 4 categories and 11 dimensions. Testing across 5 frontier LLMs reveals that while models perform well in aggregate accuracy, they consistently struggle with communication tasks, particularly in multi-turn dialogue, nonverbal understanding, and synchrony.
The emergence of LLMs in social applications—from emotional support chatbots to customer service representatives—has created an urgent need for rigorous evaluation frameworks beyond generic performance metrics. NICE addresses this gap by introducing the first comprehensive, theory-grounded diagnostic tool that goes beyond simple accuracy scoring to identify specific capability weaknesses that matter in real-world social contexts.
This research builds on longstanding psychometric principles and social theory, establishing legitimacy through expert validation and literature review. The framework's structure—encompassing Norm, Interaction, Cognition, and Experience across 11 dimensions—provides a systematic way to map social intelligence rather than treating it as a monolithic construct. By identifying communication as a consistent weakness across models while pinpointing the exact facets responsible (multi-turn exchanges, nonverbal cues, temporal synchrony), the benchmark enables targeted improvements rather than generic optimization.
For developers and AI companies deploying LLMs in social contexts, NICE provides actionable diagnostic data that reveals gaps existing benchmarks miss. The finding that models fail at communication despite strong aggregate performance suggests current training approaches may not adequately capture the nuanced, contextual nature of human interaction. This has direct implications for safety and quality in AI-mediated social services, where communication failures could harm user experiences or relationships.
Looking forward, NICE establishes a precedent for theory-grounded evaluation across other domains. As LLMs continue expanding into sensitive applications, similar diagnostic frameworks will likely become standard requirements for responsible deployment, potentially influencing product development priorities and regulatory expectations for social AI systems.
- →NICE introduces the first holistic, theory-grounded diagnostic framework for measuring LLM social intelligence across 4 categories and 11 dimensions
- →Testing reveals LLMs achieve high aggregate accuracy but consistently fail at communication tasks including multi-turn dialogue and nonverbal understanding
- →The framework enables fine-grained diagnosis of socially consequential weaknesses rather than generic performance scoring
- →Findings suggest current LLM training approaches inadequately capture contextual nuances required for authentic human interaction
- →Diagnostic benchmarks like NICE may become standard requirements for deploying LLMs in social and customer-facing applications