Repeated post-training is not Self-improving: Diagnosing Scientific Amnesia in Continual DPO Pipelines
Researchers identify 'scientific amnesia' as a critical failure mode in continual DPO (Direct Preference Optimization) training pipelines where LLMs preserve learned behaviors but fail to accumulate reusable methodological knowledge across sequential training campaigns. Testing five strategy proposers on a 30-campaign benchmark reveals that most approaches degrade performance, with only conservative rule-based scheduling showing consistent improvement.
The study addresses a practical problem faced by industrial LLM teams: repeatedly fine-tuning models on preference data often fails to produce cumulative improvements despite preserving previous capabilities. This differs from catastrophic forgetting—the model doesn't lose old knowledge, but rather struggles to apply learned training principles to new domains. The researchers formalize this intuition through diagnostic tools and test it against production-like conditions using Qwen2.5-7B-Instruct across 30 HumanEval campaigns.
The research reveals a sobering reality: four of five candidate solutions, including a meta-scientific approach called MSCL, actually degraded performance during continued training. Only deliberate conservatism in scheduling improvements proved reliable. This reflects a fundamental challenge in scaling LLM training: the methodological knowledge needed to optimize one campaign doesn't automatically transfer to the next, even when domain overlap exists.
For AI development teams, the findings suggest that naive continuation of DPO pipelines may be counterproductive. The sharp dependence on evaluation design, chain composition, and random seed coverage indicates that improvements are fragile and context-dependent. Organizations pursuing multi-campaign training must adopt defensive strategies—like conservative scheduling—rather than relying on sophisticated memory or optimization techniques that currently offer unreliable gains.
The work opens investigation into why continual learning fails at the methodological level for LLMs. Future research should explore whether architectural changes, alternative optimization algorithms, or fundamentally different training paradigms can solve scientific amnesia at scale.
- →Scientific amnesia—failing to accumulate training knowledge across campaigns—emerges as a distinct problem from catastrophic forgetting in continual DPO pipelines.
- →Most advanced memory and optimization strategies underperformed simple rule-based scheduling in the studied production-like regime.
- →Results are highly sensitive to evaluation design, training chain composition, and random seeds, limiting generalizability of solutions.
- →Industrial LLM teams may need to adopt conservative training strategies rather than sophisticated continual learning approaches.
- →The problem is diagnostic rather than solved, indicating a significant open challenge for scaling multi-campaign LLM training.