AINeutralarXiv โ CS AI ยท 7h ago7/10
๐ง
Blending Human and LLM Expertise to Detect Hallucinations and Omissions in Mental Health Chatbot Responses
Researchers demonstrate that standard LLM-as-a-judge methods achieve only 52% accuracy in detecting hallucinations and omissions in mental health chatbots, failing in high-risk healthcare contexts. A hybrid framework combining human domain expertise with machine learning features achieves significantly higher performance (0.717-0.849 F1 scores), suggesting that transparent, interpretable approaches outperform black-box LLM evaluation in safety-critical applications.