🧠 AI🔴 BearishImportance 7/10

Erased, but Not Gone: Output Forgetting Is Not True Forgetting

arXiv – CS AI|Teresa Pui Yee Yong, Win Kent Ong, Chee Seng Chan|June 25, 2026 at 04:00 AM

🤖AI Summary

Researchers demonstrate that machine unlearning methods that appear successful at the output layer—the standard evaluation metric—actually retain structured residual information in representation space compared to true retraining. This finding reveals a critical gap between apparent forgetting and genuine forgetting, suggesting current unlearning evaluations systematically overestimate effectiveness.

Analysis

This research exposes a fundamental disconnect in how machine unlearning is currently evaluated and measured. The study reveals that output-level metrics like reduced accuracy on forget sets don't guarantee true forgetting at the representational level, where the actual knowledge may persist in structured patterns. This matters because machine unlearning is increasingly critical for privacy compliance, model safety, and addressing copyright concerns in generative AI systems.

The technical core involves comparing unlearned models against retrained baselines—models trained from scratch without the data to forget. Across multiple datasets and architectures, unlearned models show consistent patterns: forget-set representations partially align with retraining outputs, retain-set representations diverge more significantly, and residual information concentrates along specific directions rather than distributing randomly. This structured mismatch indicates the models haven't truly forgotten but rather learned to obscure information through output manipulation.

For the AI industry, this creates significant implications. Organizations relying on unlearning for GDPR compliance or copyright remediation may have false confidence in their methods' effectiveness. Security researchers could potentially recover forgotten information by analyzing representation space, creating liability risks. The findings also highlight why recent unlearning papers may overstate their contributions when judged solely by output metrics.

Moving forward, the field must adopt stronger evaluation standards using retraining-consistent metrics that examine representation-level forgetting. This shift will likely slow near-term claims of breakthrough unlearning methods but establish more reliable foundations for privacy-preserving AI systems. Researchers should implement dual evaluation frameworks combining output and representational assessments.

Key Takeaways

→Standard output-level unlearning metrics systematically overestimate forgetting effectiveness by missing structured residuals in representation space.
→Retrained models serve as the operational ground truth, revealing that current unlearning methods exhibit forget/retain asymmetry and concentrated residual information.
→Machine unlearning evaluated only at output layer may leave sensitive information recoverable through representation space analysis.
→Current evaluation practices certify apparent forgetting rather than retraining-consistent forgetting, creating false confidence in privacy-preserving methods.
→The AI research community needs stronger dual-framework evaluation standards examining both output and representational-level consistency.