y0news
← Feed
Back to feed
🧠 AI NeutralImportance 7/10

Can Global XAI Methods Reveal Injected Behaviours in LLMs? SHAP vs Rule Extraction vs RuleSHAP

arXiv – CS AI|Francesco Sovrano|
🤖AI Summary

Researchers propose RuleSHAP, a novel explainable AI method that combines SHAP analysis with rule induction to detect injected behavioral triggers in large language models. The approach outperforms existing techniques by 82% in identifying belief-driven heuristics that fuel misinformation, offering a practical pathway for auditing LLM safety.

Analysis

This research addresses a critical gap in AI safety: the inability to extract explicit behavioral rules from black-box language models using traditional explainable AI methods. The study injects known behavioral triggers into GPT and Llama models—ranging from simple univariate patterns to complex non-convex triggers—then tests whether existing XAI methods can recover them as interpretable rules. The fundamental challenge stems from XAI tools being designed for numerical data, not the complex semantic space of language models.

The research emerges amid growing concern that LLMs amplify misinformation through belief-driven heuristics embedded during training. By examining three documented misinformation drivers (valence framing, information overload, oversimplification), the team provides empirical grounding for understanding how models internalize and apply problematic defaults. Their statistically validated abstraction layer maps LLM beliefs to numerical scores, enabling off-the-shelf XAI techniques.

RuleSHAP's 82% improvement over RuleFit in Mean Reciprocal Rank suggests meaningful progress in behavioral auditing. While SHAP alone ranked features well, it failed to produce symbolic rules; RuleFit extracted rules but missed non-univariate patterns. RuleSHAP bridges this gap by coupling SHAP's feature importance rankings with rule induction algorithms. For AI safety practitioners and model developers, this represents a more systematic way to identify dangerous behavioral patterns before deployment. The methodology scales the auditing toolkit beyond simple statistical checks toward capturing complex conditional behaviors that might otherwise remain hidden in model weights.

Key Takeaways
  • RuleSHAP combines SHAP feature importance with rule induction to extract behavioral rules from LLMs, achieving 82% better performance than RuleFit on complex triggers
  • Researchers successfully injected nonlinear behavioral triggers into GPT and Llama models to establish ground truth for evaluating XAI methods on language models
  • Traditional XAI methods struggle with non-univariate triggers in LLMs, highlighting the need for hybrid approaches combining statistical and symbolic techniques
  • The approach converts semantic LLM outputs to numerical abstractions, enabling application of existing XAI tools designed for numerical data
  • This methodology offers practical pathways for auditing and surfacing belief-driven heuristics that fuel misinformation in production language models
Mentioned in AI
Models
LlamaMeta
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles