AINeutralarXiv โ CS AI ยท 10h ago7/10
๐ง
SAGE: A Service Agent Graph-guided Evaluation Benchmark
Researchers introduce SAGE, a comprehensive benchmark for evaluating Large Language Models in customer service automation that uses dynamic dialogue graphs and adversarial testing to assess both intent classification and action execution. Testing across 27 LLMs reveals a critical 'Execution Gap' where models correctly identify user intents but fail to perform appropriate follow-up actions, plus an 'Empathy Resilience' phenomenon where models maintain polite facades despite underlying logical failures.