y0news
← Feed
Back to feed
🧠 AI NeutralImportance 7/10

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

arXiv – CS AI|Qihan Ren, Peng Wang, Ruikun Cai, Shuai Shao, Dadi Guo, Yuejin Xie, Yafu Li, Quanshi Zhang, Xia Hu, Jing Shao, Dongrui Liu|
🤖AI Summary

Researchers challenge the conventional wisdom that supervised finetuning (SFT) merely memorizes while reinforcement learning generalizes. Their analysis reveals that reasoning SFT with chain-of-thought supervision can generalize across domains, but success depends critically on optimization duration, data quality, and base model strength, with generalization improvements coming at the cost of degraded safety performance.

Analysis

The study addresses a fundamental debate in large language model development: whether post-training approaches enable genuine learning or shallow pattern replication. The researchers demonstrate that cross-domain generalization in reasoning tasks is achievable through SFT, contradicting the prevailing narrative that exclusively credits reinforcement learning with this capability. However, this finding comes with important caveats that reshape how practitioners should approach model training.

The research identifies three critical conditions determining generalization success. First, optimization dynamics matter significantly—models exhibit a dip-and-recovery pattern where performance temporarily declines before improving, meaning premature checkpoint selection can incorrectly suggest generalization failure. Second, data quality directly influences transferability; low-quality training data broadly undermines cross-domain performance while verified long chain-of-thought traces consistently improve it. Third, model capability acts as a foundation—stronger base models internalize abstract procedural patterns like backtracking that transfer across domains, while weaker models merely replicate surface-level patterns without deep understanding.

The asymmetric nature of this generalization presents a critical tradeoff. Improvements in reasoning capability come paired with degraded safety performance, fundamentally reframing the optimization problem from binary success-or-failure to managed capability-safety dynamics. This has direct implications for practitioners developing reasoning systems, suggesting that scaling SFT effectiveness requires simultaneous attention to data engineering, extended training schedules, and model selection rather than relying solely on reinforcement learning approaches. The findings also highlight that claimed generalization failures in prior research may reflect incomplete training rather than inherent memorization properties of SFT.

Key Takeaways
  • Cross-domain generalization in reasoning SFT is conditional and achievable, not categorically impossible as often claimed.
  • Dip-and-recovery training patterns mean early stopping can incorrectly suggest generalization failure when extended training would succeed.
  • Data quality and verified long-form reasoning traces substantially outperform low-quality solutions in enabling transferable learning.
  • Stronger base models learn abstract procedural patterns while weaker models remain bound to surface-level imitation.
  • Reasoning improvement and safety degradation occur together, requiring intentional tradeoff management rather than treating them independently.
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles