AINeutralarXiv – CS AI · 7h ago7/10
🧠
Automatically Finding and Validating Unexpected Side-Effects of Interventions on Language Models
Researchers present an automated pipeline for auditing behavioral changes in large language models when interventions are applied. The method generates human-readable hypotheses about model differences and validates them statistically, successfully identifying both intended and unexpected side-effects across real-world interventions like knowledge editing and unlearning.