Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?
Researchers demonstrate that reinforcement learning (RL) preserves internal computational circuits in large language models better than supervised fine-tuning (SFT) during task adaptation. Using a new metric called differential circuit vulnerability on Qwen2.5-3B-Instruct, they reveal a mechanistic trade-off: SFT adapts faster but causes substantial circuit disruption and capability forgetting, while RL maintains base model circuits at the cost of slower learning.
This research addresses a fundamental challenge in deploying large language models: the tendency to forget previously learned capabilities when fine-tuned on new tasks. The study bridges behavioral observations with mechanistic explanations by examining what actually happens inside the model during different training approaches.
The distinction between RL and SFT reflects broader differences in optimization philosophy. SFT uses direct gradient updates toward target outputs, creating rapid but destructive changes to the model's internal representations. RL's policy-gradient approach constrains updates to remain closer to the original policy distribution, resulting in more conservative modifications to learned circuits. This mechanistic insight explains empirical findings that RL-trained models retain broader capability profiles.
For practitioners developing multi-capability systems, this research has immediate implications. Organizations deploying language models that must maintain both specialized performance and general knowledge face a genuine trade-off. The slower task adaptation in RL requires computational patience but yields models less prone to catastrophic forgetting. This matters for production systems where retraining costs are high and broad capability retention is valuable.
The introduction of differential circuit vulnerability provides a diagnostic tool for evaluating fine-tuning methods at the architectural level. Future research will likely explore hybrid approaches that combine RL's circuit-preservation advantages with faster adaptation mechanisms, or investigate whether selective circuit protection could accelerate RL-based training without sacrificing robustness.
- βRL preserves internal computational circuits significantly better than SFT during model fine-tuning, explaining its resistance to catastrophic forgetting.
- βSFT achieves faster task adaptation but causes substantially greater circuit degradation and loss of prior capabilities.
- βDifferential circuit vulnerability provides a new mechanistic metric for measuring how much internal model structure degrades during fine-tuning.
- βThe trade-off between rapid adaptation and circuit preservation represents a fundamental design choice in training approaches for multi-capability systems.
- βCircuit-level analysis reveals mechanistic explanations for behavioral observations about RL's robustness to catastrophic forgetting.