Zipping the Thought: When and How Compressed Reasoning Data Works in LLM Post-Training
Researchers propose a taxonomy of chain-of-thought (CoT) reasoning in LLM post-training, distinguishing between explicit, composed, and implicit reasoning formats. The study reveals that compressed reasoning data requires different training approaches, with composed CoT benefiting from data scaling while implicit CoT risks memorization, and that reinforcement learning can decompose compressed steps learned during supervised fine-tuning.
This arXiv research addresses a fundamental tension in LLM development: how to achieve strong reasoning performance while managing computational costs. The study systematically categorizes chain-of-thought reasoning compression and measures its effects on model training, providing empirical evidence that challenges conventional assumptions about data scaling in post-training optimization.
The research builds on growing recognition that LLM reasoning requires careful engineering of training data. As models tackle increasingly complex problems, the length of intermediate reasoning steps creates significant token overhead during inference. This work moves beyond anecdotal observations to establish a framework showing that different compression strategies—combining steps versus omitting them entirely—create fundamentally different learning dynamics.
The findings have direct implications for AI labs and commercial LLM providers balancing performance against inference costs. Organizations can optimize their post-training approaches based on available data budgets: composed CoT offers benefits from additional data, while implicit CoT's memorization risk suggests it may only suit specialized fine-tuning scenarios. The observation that reinforcement learning decomposes compressed reasoning steps suggests a potential synergy between SFT and RL phases that practitioners could exploit.
The research also reveals subtle differences in model generalization based on CoT ordering, with unidirectional presentations supporting longer sequential tasks better. This suggests reasoning format may matter as much as reasoning quality. Going forward, practitioners should monitor whether commercial models incorporate these design principles, and researchers should explore whether these findings extend to multi-step reasoning beyond synthetic tasks.
- →Coarser compressed reasoning requires substantially more supervised fine-tuning data to achieve equivalent performance.
- →Composed CoT benefits from data scaling while implicit CoT exhibits memorization patterns, requiring different training strategies.
- →Reinforcement learning actively decomposes compressed reasoning steps learned during supervised fine-tuning, suggesting complementary training phases.
- →Unidirectional chain-of-thought ordering demonstrates stronger generalization on longer sequential reasoning tasks.
- →Data resource constraints require tailored CoT design decisions based on specific compression granularity and available training data.