Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning
Researchers identify critical honesty failures in Large Language Model unlearning methods, where models hallucinate or behave inconsistently after attempting to forget harmful training data. They propose ReVa, a representation-alignment procedure that significantly improves model honesty by better acknowledging forgotten knowledge while maintaining utility on retained information.
LLM unlearning represents an emerging challenge in AI safety—the ability to selectively remove harmful training data from language models without degrading overall performance. This research exposes a fundamental tension between forgetting and truthfulness: existing unlearning methods achieve data removal but introduce dishonest behaviors like hallucinations and inconsistent responses. The finding that all nine tested methods across three major families fail honesty standards suggests the field lacks adequate evaluation frameworks for real-world deployment.
The dishonesty problem stems from a deeper issue in how models learn and unlearn representations. When forced to forget specific knowledge, current approaches leave models in ambiguous states where they sometimes generate plausible-sounding but false information rather than admitting knowledge gaps. This mirrors broader challenges in LLM reliability, where models confidently output incorrect information rather than expressing uncertainty.
ReVa's approach—using representation alignment to train models to reject or acknowledge forgotten knowledge consistently—addresses this gap directly. The near-doubling of rejection rates after two interaction rounds indicates models can learn to handle uncertainty about forgotten data more gracefully. The secondary benefit of improving honesty on retained knowledge suggests that aligning representations actually stabilizes overall model behavior.
This work carries implications for AI governance and trustworthiness standards. As regulations increasingly demand the ability to remove data from trained models, organizations cannot simply rely on forgetting effectiveness metrics alone. Future deployment of unlearned models must incorporate honesty evaluation into compliance procedures, potentially requiring standardized benchmarks similar to what this research proposes.
- →All current LLM unlearning methods fail to maintain honesty while removing knowledge, causing hallucinations and inconsistent responses.
- →ReVa's representation-alignment approach nearly doubles rejection rates for forgotten knowledge while improving honesty on retained data.
- →Dishonesty in unlearning occurs when models generate false information rather than admitting they've forgotten something.
- →Current unlearning evaluation metrics lack adequate measures of model honesty and transparency about knowledge limitations.
- →Proper unlearning with honesty standards may become a regulatory requirement as AI governance frameworks develop.