y0news
← Feed
←Back to feed
🧠 AI🟒 BullishImportance 7/10

Rewriting Pre-Training Data Boosts LLM Performance in Math and Code

arXiv – CS AI|Kazuki Fujii, Yukito Tajima, Sakae Mizuki, Masaki Kawamura, Hinari Shimada, Taihei Shiotani, Koshiro Saito, Masanari Oi, Taishi Nakamura, Takumi Okamoto, Shigeki Ishida, Kakeru Hattori, Youmi Ma, Hiroya Takamura, Rio Yokota, Jun Sakuma, Naoaki Okazaki||4 views
πŸ€–AI Summary

Researchers released two open-source datasets, SwallowCode and SwallowMath, that significantly improve large language model performance in coding and mathematics through systematic data rewriting rather than filtering. The datasets boost Llama-3.1-8B performance by +17.0 on HumanEval for coding and +12.4 on GSM8K for math tasks.

Key Takeaways
  • β†’SwallowCode dataset contains 16.1 billion tokens of refined Python code using a four-stage rewriting pipeline.
  • β†’SwallowMath dataset includes 2.3 billion tokens of enhanced mathematical solutions with step-by-step explanations.
  • β†’The transform-and-retain approach outperforms traditional filtering methods by refining low-quality data instead of discarding it.
  • β†’Continual pre-training with these datasets shows substantial performance gains across multiple benchmarks.
  • β†’All datasets, code, and methodologies are released under open licenses for reproducibility and adaptation.
Read Original β†’via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains β€” you keep full control of your keys.
Connect Wallet to AI β†’How it works
Related Articles