y0news
← Feed
Back to feed
🧠 AI NeutralImportance 6/10

TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent

arXiv – CS AI|Xingyu Sui, Yanyan Zhao, Yulin Hu, Jiahe Guo, Weixiang Zhao, Bing Qin|
🤖AI Summary

Researchers introduce TEA-Bench, the first interactive benchmark for evaluating how external tools improve emotional support conversation (ESC) systems. Testing nine LLMs reveals that tool augmentation reduces hallucination and improves support quality, but effectiveness depends heavily on model capacity—stronger models leverage tools more effectively than weaker ones.

Analysis

TEA-Bench addresses a critical gap in emotional support AI systems by introducing the first systematic evaluation framework that measures how external tools enhance both affective and instrumental support. Traditional ESC systems have focused primarily on emotional expression in isolated text exchanges, missing opportunities for factual grounding through integrated tools. This research demonstrates that connecting LLMs to structured tool environments—following an MCP-style architecture—meaningfully reduces hallucination while improving the reliability of guidance offered to users in emotionally sensitive contexts.

The research builds on growing recognition that LLMs alone struggle with factual accuracy in high-stakes applications. Emotional support conversations represent a particularly important use case because users vulnerable during crisis moments may act on incorrect information. The benchmark's process-level metrics evaluate support quality alongside factual grounding, creating accountability that text-only metrics miss. The team also released TEA-Dialog, a dataset enabling supervised fine-tuning experiments.

Key findings reveal a capacity hierarchy: GPT-4 and similar frontier models use tools selectively and strategically, while smaller models show marginal improvements despite tool access. This suggests that tool integration alone doesn't guarantee better outputs—model reasoning capacity fundamentally limits tool utilization effectiveness. Supervised fine-tuning improved in-distribution performance but generalized poorly, indicating that tool-enhanced ESC requires robust architectural decisions rather than simple training tricks.

The work signals growing maturity in AI safety research, moving beyond theoretical concerns toward practical benchmarks. Organizations building emotional support AI systems now have concrete evaluation frameworks and baseline data for tool-augmented approaches, likely accelerating adoption of more reliable architectures.

Key Takeaways
  • Tool augmentation reduces hallucination in emotional support conversations but effectiveness correlates with model capacity.
  • Smaller language models show minimal gains from tool access, indicating that architectural capability limits tool utility.
  • TEA-Bench provides the first systematic evaluation framework specifically designed for tool-enhanced emotional support dialogue agents.
  • Supervised fine-tuning on tool-enhanced dialogues improves in-distribution performance but fails to generalize to new scenarios.
  • Process-level metrics that jointly assess emotional quality and factual grounding enable more rigorous evaluation than text-only approaches.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles