βBack to feed
π§ AIπ’ BullishImportance 7/10
Rewriting Pre-Training Data Boosts LLM Performance in Math and Code
arXiv β CS AI|Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki||4 views
π€AI Summary
Researchers released two open-source datasets, SwallowCode and SwallowMath, that significantly improve large language model performance in coding and mathematics through systematic data rewriting rather than filtering. The datasets boost Llama-3.1-8B performance by +17.0 on HumanEval for coding and +12.4 on GSM8K for math tasks.
Key Takeaways
- βSwallowCode dataset contains 16.1 billion tokens of refined Python code using a four-stage rewriting pipeline.
- βSwallowMath dataset includes 2.3 billion tokens of enhanced mathematical solutions with step-by-step explanations.
- βThe transform-and-retain approach outperforms traditional filtering methods by refining low-quality data instead of discarding it.
- βContinual pre-training with these datasets shows substantial performance gains across multiple benchmarks.
- βAll datasets, code, and methodologies are released under open licenses for reproducibility and adaptation.
#llm#pre-training#datasets#code-generation#mathematical-reasoning#open-source#llama#data-quality#ai-performance
Read Original βvia arXiv β CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β you keep full control of your keys.
Related Articles