🧠 AI🟢 BullishImportance 7/10

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

arXiv – CS AI|Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki|March 3, 2026 at 05:00 AM|4 views

🤖AI Summary

Researchers released two open-source datasets, SwallowCode and SwallowMath, that significantly improve large language model performance in coding and mathematics through systematic data rewriting rather than filtering. The datasets boost Llama-3.1-8B performance by +17.0 on HumanEval for coding and +12.4 on GSM8K for math tasks.

Key Takeaways

→SwallowCode dataset contains 16.1 billion tokens of refined Python code using a four-stage rewriting pipeline.
→SwallowMath dataset includes 2.3 billion tokens of enhanced mathematical solutions with step-by-step explanations.
→The transform-and-retain approach outperforms traditional filtering methods by refining low-quality data instead of discarding it.
→Continual pre-training with these datasets shows substantial performance gains across multiple benchmarks.
→All datasets, code, and methodologies are released under open licenses for reproducibility and adaptation.