AINeutralarXiv – CS AI · 18h ago7/10
🧠
Can Global XAI Methods Reveal Injected Behaviours in LLMs? SHAP vs Rule Extraction vs RuleSHAP
Researchers propose RuleSHAP, a novel explainable AI method that combines SHAP analysis with rule induction to detect injected behavioral triggers in large language models. The approach outperforms existing techniques by 82% in identifying belief-driven heuristics that fuel misinformation, offering a practical pathway for auditing LLM safety.
🧠 Llama