🧠 AI⚪ NeutralImportance 6/10

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

arXiv – CS AI|Jeanmely Rojas Nunez, Viraj Sawant, Nathan Allen, Nomgondalai Amgalanbaatar, Yannis Zongo, Vasu Sharma, Maheep Chaudhary|May 29, 2026 at 04:00 AM

🤖AI Summary

Researchers demonstrate that reinforcement learning (RL) preserves internal computational circuits in large language models better than supervised fine-tuning (SFT) during task adaptation. Using a new metric called differential circuit vulnerability on Qwen2.5-3B-Instruct, they reveal a mechanistic trade-off: SFT adapts faster but causes substantial circuit disruption and capability forgetting, while RL maintains base model circuits at the cost of slower learning.

Analysis

This research addresses a fundamental challenge in deploying large language models: the tendency to forget previously learned capabilities when fine-tuned on new tasks. The study bridges behavioral observations with mechanistic explanations by examining what actually happens inside the model during different training approaches.

The distinction between RL and SFT reflects broader differences in optimization philosophy. SFT uses direct gradient updates toward target outputs, creating rapid but destructive changes to the model's internal representations. RL's policy-gradient approach constrains updates to remain closer to the original policy distribution, resulting in more conservative modifications to learned circuits. This mechanistic insight explains empirical findings that RL-trained models retain broader capability profiles.

For practitioners developing multi-capability systems, this research has immediate implications. Organizations deploying language models that must maintain both specialized performance and general knowledge face a genuine trade-off. The slower task adaptation in RL requires computational patience but yields models less prone to catastrophic forgetting. This matters for production systems where retraining costs are high and broad capability retention is valuable.

The introduction of differential circuit vulnerability provides a diagnostic tool for evaluating fine-tuning methods at the architectural level. Future research will likely explore hybrid approaches that combine RL's circuit-preservation advantages with faster adaptation mechanisms, or investigate whether selective circuit protection could accelerate RL-based training without sacrificing robustness.

Key Takeaways

→RL preserves internal computational circuits significantly better than SFT during model fine-tuning, explaining its resistance to catastrophic forgetting.
→SFT achieves faster task adaptation but causes substantially greater circuit degradation and loss of prior capabilities.
→Differential circuit vulnerability provides a new mechanistic metric for measuring how much internal model structure degrades during fine-tuning.
→The trade-off between rapid adaptation and circuit preservation represents a fundamental design choice in training approaches for multi-capability systems.
→Circuit-level analysis reveals mechanistic explanations for behavioral observations about RL's robustness to catastrophic forgetting.

#large-language-models #reinforcement-learning #catastrophic-forgetting #model-circuits #fine-tuning #supervised-learning #mechanistic-interpretability

Read Original →via arXiv – CS AI

Act on this with AI

Stay ahead of the market.

Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.

Connect Wallet to AI →How it works

AIMay 6

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

AIMay 6

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

AIMay 6

Mechanistic origins of catastrophic forgetting: why RL preserves circuits better than SFT?

Your company’s AI could delete everything in 9 seconds. ServiceNow wants to be the kill switch

Hut 8 (HUT) Stock Soars 37% on Massive $9.8 Billion AI Data Center Agreement

S&P 500 and NASDAQ hit record highs as AI chip stocks surge