y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#operational-safety News & Analysis

2 articles tagged with #operational-safety. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

2 articles
AINeutralarXiv โ€“ CS AI ยท Apr 137/10
๐Ÿง 

SAGE: A Service Agent Graph-guided Evaluation Benchmark

Researchers introduce SAGE, a comprehensive benchmark for evaluating Large Language Models in customer service automation that uses dynamic dialogue graphs and adversarial testing to assess both intent classification and action execution. Testing across 27 LLMs reveals a critical 'Execution Gap' where models correctly identify user intents but fail to perform appropriate follow-up actions, plus an 'Empathy Resilience' phenomenon where models maintain polite facades despite underlying logical failures.

AIBearisharXiv โ€“ CS AI ยท Mar 167/10
๐Ÿง 

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

Researchers introduced OffTopicEval, a benchmark revealing that all major LLMs suffer from poor operational safety, with even top performers like Qwen-3 and Mistral achieving only 77-80% accuracy in staying on-topic for specific use cases. The study proposes prompt-based steering methods that can improve performance by up to 41%, highlighting critical safety gaps in current AI deployment.

๐Ÿง  Llama