🧠 AI⚪ NeutralImportance 7/10

Natural Ungrokking: Asymmetric Control of Which Rules Survive Pretraining

arXiv – CS AI|Juliana Li, Diya Sreedhar|June 25, 2026 at 04:00 AM

🤖AI Summary

Researchers discovered that language models forget learned rules midway through training despite continued evidence in data—a phenomenon called 'natural ungrokking.' The survival of rules depends predictably on how often they appear in training data, and attempts to restore forgotten rules through data manipulation fail despite successfully destroying them, revealing asymmetric control over model knowledge.

Analysis

This research exposes a fundamental instability in language model training: learned behaviors can reverse without warning, erased from model weights despite remaining accessible in the training corpus. The finding challenges assumptions that neural networks monotonically accumulate knowledge. During pretraining, models learn generalizable rules like pronoun-gender agreement, but these rules can collapse as competing patterns in the data outcompete them, even when the original rule's supporting evidence persists unchanged. The mechanism appears deterministic—survival depends entirely on support frequency, the ratio of rule-confirming examples to total training instances. Across multiple model scales and datasets, this pattern holds with consistency, suggesting it reflects something fundamental about how neural networks balance competing signals during optimization. The asymmetry in control is particularly striking: researchers can reliably kill rules by reducing their support frequency, but cannot resurrect forgotten rules even by flooding training data with 450x normal confirmation levels. This one-way valve suggests forgetting isn't simple decay but active displacement—a different pattern has captured the model's capacity, and recovering the old rule requires overcoming that entrenchment. For AI safety and interpretability, these findings matter deeply. They demonstrate that model behavior remains sensitive to training data composition in unpredictable ways, with knowledge appearing stable in loss curves while fragile in learned representations. Understanding which rules survive pretraining directly impacts model reliability and predictability in deployment, particularly for safety-critical applications where rule stability cannot be assumed.

Key Takeaways

→Language models forget learned rules mid-training despite continued data evidence, driven by competing surface patterns rather than data scarcity.
→Rule survival depends predictably on support frequency—the ratio of confirming to total training examples—across different model scales and datasets.
→Forgotten rules cannot be restored through data manipulation, even at 450x natural support levels, revealing asymmetric control over learned knowledge.
→Collapse dynamics appear consistent across public model checkpoints, suggesting this ungrokking phenomenon reflects fundamental training dynamics rather than quirks.
→Loss curves mask internal knowledge instability, indicating current metrics fail to detect rule fragility during pretraining.