y0news
AnalyticsDigestsSourcesRSSAICrypto
#operational-safety1 article
1 articles
AIBearisharXiv โ€“ CS AI ยท 7h ago7/10
๐Ÿง 

OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always!

Researchers introduced OffTopicEval, a benchmark revealing that all major LLMs suffer from poor operational safety, with even top performers like Qwen-3 and Mistral achieving only 77-80% accuracy in staying on-topic for specific use cases. The study proposes prompt-based steering methods that can improve performance by up to 41%, highlighting critical safety gaps in current AI deployment.

๐Ÿง  Llama