Personalized Turn-Level User Conversation Satisfaction Benchmark
Researchers introduce a personalized turn-level conversation satisfaction benchmark that evaluates AI assistant responses based on individual user expectations and conversation history rather than generic quality metrics. The system combines user memory with context-specific evaluation to produce satisfaction scores and identifies dissatisfying responses more accurately than existing methods.
This research addresses a fundamental limitation in AI evaluation methodology: the assumption that response quality is universal. In reality, user satisfaction depends heavily on personal preferences, prior interactions, and specific expectations within a conversation. The proposed evaluator tackles this by maintaining compact user memory profiles alongside turn-level context, enabling more nuanced assessment than current LLM-as-a-judge approaches that treat each response in isolation. The work demonstrates that personalized memory combined with post-hoc score calibration significantly improves both ordinal agreement with human annotators and accuracy in detecting dissatisfying responses. PersTurnBench, the accompanying benchmark, solves a practical evaluation challenge: testing new generation models requires either expensive human annotation for each candidate or potentially biased metrics. By fixing the replay state, the benchmark enables controlled comparison of different models without recollecting human feedback. This is particularly valuable for developers building memory-augmented and personalized AI systems, where generic benchmarks fail to capture performance improvements in user satisfaction. The methodology also advances meta-evaluation practices by distinguishing between satisfied and dissatisfied turn detection, reflecting real-world priorities where missing negative experiences matters more than perfectly calibrating satisfaction scores. The work sits at the intersection of conversational AI, evaluation methodology, and personalization research, with implications for how AI systems are tested before deployment.
- βPersonalized user memory significantly improves AI satisfaction evaluation compared to generic response quality metrics.
- βPersTurnBench enables controlled model comparison without requiring new human annotations for each candidate system.
- βDissatisfaction-oriented detection outperforms generic satisfaction scoring in identifying problematic responses.
- βPost-hoc score calibration improves alignment between automated evaluators and human satisfaction judgments.
- βThis framework enables fair benchmarking of memory-augmented personalized AI systems against generic models.