AINeutralarXiv – CS AI · 10h ago7/10
🧠
Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning
Researchers identify critical honesty failures in Large Language Model unlearning methods, where models hallucinate or behave inconsistently after attempting to forget harmful training data. They propose ReVa, a representation-alignment procedure that significantly improves model honesty by better acknowledging forgotten knowledge while maintaining utility on retained information.