Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability
Researchers challenge the conventional wisdom that supervised finetuning (SFT) merely memorizes while reinforcement learning generalizes. Their analysis reveals that reasoning SFT with chain-of-thought supervision can generalize across domains, but success depends critically on optimization duration, data quality, and base model strength, with generalization improvements coming at the cost of degraded safety performance.
The study addresses a fundamental debate in large language model development: whether post-training approaches enable genuine learning or shallow pattern replication. The researchers demonstrate that cross-domain generalization in reasoning tasks is achievable through SFT, contradicting the prevailing narrative that exclusively credits reinforcement learning with this capability. However, this finding comes with important caveats that reshape how practitioners should approach model training.
The research identifies three critical conditions determining generalization success. First, optimization dynamics matter significantly—models exhibit a dip-and-recovery pattern where performance temporarily declines before improving, meaning premature checkpoint selection can incorrectly suggest generalization failure. Second, data quality directly influences transferability; low-quality training data broadly undermines cross-domain performance while verified long chain-of-thought traces consistently improve it. Third, model capability acts as a foundation—stronger base models internalize abstract procedural patterns like backtracking that transfer across domains, while weaker models merely replicate surface-level patterns without deep understanding.
The asymmetric nature of this generalization presents a critical tradeoff. Improvements in reasoning capability come paired with degraded safety performance, fundamentally reframing the optimization problem from binary success-or-failure to managed capability-safety dynamics. This has direct implications for practitioners developing reasoning systems, suggesting that scaling SFT effectiveness requires simultaneous attention to data engineering, extended training schedules, and model selection rather than relying solely on reinforcement learning approaches. The findings also highlight that claimed generalization failures in prior research may reflect incomplete training rather than inherent memorization properties of SFT.
- →Cross-domain generalization in reasoning SFT is conditional and achievable, not categorically impossible as often claimed.
- →Dip-and-recovery training patterns mean early stopping can incorrectly suggest generalization failure when extended training would succeed.
- →Data quality and verified long-form reasoning traces substantially outperform low-quality solutions in enabling transferable learning.
- →Stronger base models learn abstract procedural patterns while weaker models remain bound to surface-level imitation.
- →Reasoning improvement and safety degradation occur together, requiring intentional tradeoff management rather than treating them independently.