LLM-Based Multi-Reference Evaluation for Efficient and Robust Assessment of Phrase Break Annotations
Researchers propose LLM-Based Multi-Reference Evaluation (LMRE), a new method for assessing phrase break annotations in speech that acknowledges multiple valid phrasings rather than assuming a single correct interpretation. Tested on 1,356 Korean annotations, LMRE demonstrates stronger alignment with human judgment than traditional single-reference approaches, suggesting large language models can effectively evaluate prosodic speech characteristics at scale.
This research addresses a fundamental limitation in speech evaluation methodology. Traditional annotation assessment relies on single-reference gold standards, which fails to account for the inherent flexibility in prosodic phrasing—how speakers naturally break phrases affects speech naturalness but admits multiple valid interpretations. The gap between rigid evaluation and human flexibility has created a bottleneck for scaling speech annotation quality assessment.
The proposed LMRE system leverages large language models to generate multiple acceptable phrase break variations from minimal examples, fundamentally changing how evaluation works. Rather than comparing annotations against one correct answer, the system recognizes that five different phrase break strategies might all be linguistically valid. This mirrors how human evaluators actually assess speech, introducing nuance that traditional metrics lack.
The Korean testbed results demonstrate measurable improvements in both acceptance behavior correlation and score consistency with human judgment. This matters for speech synthesis, prosody modeling, and text-to-speech systems where naturalness depends on subtle prosodic choices. Companies developing voice AI products face quality assurance challenges when training data requires manual evaluation—LMRE offers a scalable alternative that maintains human-level judgment quality.
The broader implication extends beyond speech evaluation. This approach validates LLMs as effective evaluators for subjective linguistic tasks where multiple correct answers exist. As AI systems increasingly handle nuanced language tasks, evaluation methods must evolve from binary correctness frameworks to probabilistic, multi-reference models. Future applications could include dialogue quality assessment, translation evaluation, and other domains where ground truth is inherently ambiguous.
- →LLM-based multi-reference evaluation outperforms single-reference methods for assessing phrase break annotations in speech synthesis
- →The approach acknowledges that multiple valid prosodic phrasings exist for the same utterance rather than assuming unique correct interpretations
- →LMRE demonstrates stronger correlation with human judgment in both acceptance rates and scoring consistency on 1,356 Korean test annotations
- →The method scales evaluation without requiring labor-intensive human assessment, addressing a key bottleneck in speech annotation quality control
- →Results suggest large language models can effectively evaluate subjective linguistic tasks beyond generation, with applications across speech and language processing