Researchers introduce NILC, a novel clustering framework that combines large language models with iterative refinement to improve new intent discovery in dialogue systems. Unlike traditional cascaded approaches relying solely on embedding-based K-Means clustering, NILC leverages LLMs to enhance cluster semantics and augment ambiguous utterances, demonstrating consistent performance gains across multiple benchmark datasets.
NILC addresses a fundamental limitation in dialogue system design: the challenge of recognizing both known and novel user intents from unlabeled data. Traditional new intent discovery pipelines treat embedding generation and clustering as independent stages, missing opportunities for mutual refinement. This research demonstrates how LLM integration can bridge that gap through semantic enrichment and iterative feedback loops.
The technical contribution centers on three key mechanisms. First, NILC generates supplementary semantic centroids alongside standard Euclidean centroids, capturing nuanced contextual meanings that pure vector embeddings miss. Second, the framework identifies hard samples—ambiguous or sparse utterances—and uses LLMs to rewrite them for improved cluster alignment. Third, it applies semi-supervised learning through seeding and soft constraints, injecting human knowledge efficiently into the unsupervised clustering process.
For dialogue system developers, this work provides practical value by improving intent classification accuracy without requiring extensive labeled datasets. The consistency of improvements across diverse domain benchmarks suggests broad applicability across customer service, virtual assistants, and chatbot applications. Performance gains directly translate to better user experience and reduced misclassification errors that degrade conversational quality.
The research validates a broader trend: combining specialized LLMs with classical machine learning techniques yields better results than either approach alone. As dialogue systems increasingly handle complex, domain-specific intents, frameworks that can simultaneously leverage embedding geometry and semantic understanding become increasingly valuable. Future work may extend this iterative refinement pattern to other NLP tasks facing similar cascaded architecture limitations.
- →NILC combines LLM-generated semantic centroids with embedding-based clustering for superior intent discovery performance
- →The framework iteratively refines uncertain utterances through LLM-assisted rewriting rather than treating clustering as a single-pass operation
- →Semi-supervised techniques using seeding and soft constraints improve accuracy with minimal labeled data
- →Performance improvements consistently exceed baseline methods across six diverse benchmark datasets
- →The approach addresses the fundamental limitation of embedding-only clustering by capturing contextual nuances LLMs understand