RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs
Researchers present a novel framework analyzing how reinforcement learning (RL) and supervised fine-tuning (SFT) differently shape reasoning in large language models. The study reveals that RL compresses incorrect reasoning paths while SFT expands correct ones, explaining why the two-stage training approach produces superior reasoning capabilities across models of 1.5B to 14B parameters.
This research provides critical insights into the mechanics of reasoning LLM training by moving beyond simple accuracy metrics to examine the structural changes in how models process reasoning tasks. Rather than treating RL and SFT as interchangeable components, the analysis demonstrates they operate through fundamentally different mechanisms—RL acts as a pruning mechanism that eliminates flawed reasoning pathways, while SFT functions as an expansion mechanism that strengthens and diversifies correct reasoning approaches.
The step-level findings are particularly revealing. RL steepens the decay rates of reasoning step distributions by 2.5x, concentrating cognitive functionality into fewer critical steps, while SFT flattens these rates to one-third, distributing reasoning across more steps. This suggests RL creates efficient but narrow reasoning pathways, whereas SFT develops more redundant but robust reasoning networks. The research contextualizes why industry best practices favor SFT-then-RL sequencing: SFT first establishes diverse correct reasoning patterns, and RL then optimizes which patterns are most efficient.
For the AI development community, this framework enables more targeted optimization of reasoning systems. Data engineers can design training sets that leverage these complementary effects rather than treating them as independent improvements. The findings suggest that current two-stage approaches may not be optimal—hybrid methods or intermediate stages could potentially accelerate reasoning capability development. The graph topology analysis also provides metrics for evaluating reasoning quality beyond traditional benchmarks, enabling researchers to diagnose where models are failing and why.
- →RL compresses reasoning by concentrating functionality into fewer critical steps, while SFT expands it across diverse pathways, explaining their complementary training effects.
- →RL increases step distribution decay rates 2.5x while SFT reduces them to one-third, revealing opposite structural impacts on reasoning graph topology.
- →The framework demonstrates why SFT-then-RL training order works better than alternatives by first establishing diverse correct paths before optimization.
- →Graph-based reasoning analysis provides new metrics beyond accuracy for diagnosing and improving LLM reasoning capabilities.
- →Findings suggest potential for hybrid training approaches or new intermediate stages to further accelerate reasoning model development.