y0news
← Feed
Back to feed
🧠 AI NeutralImportance 7/10

Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

arXiv – CS AI|Mingqian Zheng, Malia Morgan, Liwei Jiang, Carolyn Rose, Maarten Sap|
🤖AI Summary

Researchers introduce CarryOnBench, a new interactive benchmark that evaluates whether large language models can recover helpfulness when users clarify benign intent across multi-turn conversations while maintaining safety. Testing 14 models with nearly 24,000 responses reveals that models significantly withhold information due to intent misinterpretation rather than knowledge limitations, and identifies three failure modes—utility lock-in, unsafe recovery, and repetitive recovery—that single-turn safety evaluations miss.

Analysis

The research addresses a critical blind spot in current LLM safety evaluation methodologies. While existing benchmarks focus on whether models resist adversarial attacks, they fail to measure whether models can appropriately adjust their responses when legitimate users provide clarifying context. This distinction matters because overly conservative safety measures can render models unhelpful to benign users, creating a user experience problem that extends beyond pure safety concerns.

The gap between initial performance (10.5-37.6% utility fulfillment) and performance when intent is stated upfront (25.1-72.1%) demonstrates that models systematically misinterpret queries rather than lack information. This reveals a calibration problem in safety alignment—models err toward rejection even when clarification should resolve concerns. The introduction of three previously undetected failure modes shows that safety gains achieved in single-turn interactions may disguise problematic behavior in realistic multi-turn scenarios where users naturally refine their requests.

For AI developers and deployment teams, these findings suggest that current safety evaluations provide incomplete signals about production readiness. A model appearing acceptably safe in benchmarks might frustrate users through unresponsiveness, or conversely, recover utility at unacceptable safety costs. The convergence of conversation harmfulness levels regardless of initial conservatism raises questions about whether different safety approaches produce meaningfully different outcomes over realistic interaction lengths.

Future work should focus on developing safety approaches that maintain robustness while preserving responsiveness to genuine clarification. This research establishes that the industry needs multi-turn evaluation frameworks as standard practice, not optional assessment tools.

Key Takeaways
  • Models fulfill only 10.5-37.6% of benign information needs in initial responses, but 25.1-72.1% when intent is stated upfront, indicating systematic misinterpretation rather than knowledge gaps.
  • Three previously invisible failure modes emerge in multi-turn conversations: utility lock-in (ignoring clarification), unsafe recovery (sacrificing safety for utility), and repetitive recovery (recycling old responses).
  • 13 of 14 models can recover utility when users provide benign clarifications, suggesting multi-turn interactions offer redemption paths that single-turn evaluations cannot detect.
  • Conversations converge to similar harmfulness levels regardless of model conservatism at start, implying that initial safety tuning may not determine long-term safety outcomes.
  • Single-turn safety benchmarks fail to distinguish between appropriately cautious models and unresponsive ones, masking real deployment problems.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles