y0news
← Feed
Back to feed
🧠 AI🟢 BullishImportance 7/10

SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

arXiv – CS AI|Ashima Suvarna, Kendrick Phan, Mehrab Beikzadeh, Hritik Bansal, Saadia Gabriel|
🤖AI Summary

SUPERNOVA introduces a framework for extending reinforcement learning with verifiable rewards (RLVR) beyond STEM fields by systematically curating data from natural instruction datasets. A 25K-instance dataset trained on smaller models achieves 64.4 percentage point gains on complex reasoning benchmarks, with improvements generalizing across model scales and families.

Analysis

SUPERNOVA addresses a critical limitation in AI reasoning research: while reinforcement learning has driven significant progress in mathematics and code, extending these techniques to general reasoning tasks remains constrained by scarce verifiable training data. This work demonstrates that natural instruction datasets—existing repositories of human-annotated examples—can be systematically mined for RLVR training through strategic curation. The research conducted over 100 controlled experiments investigating three data design dimensions: source task selection, task mixing, and synthetic interventions. Results reveal that task selection tailored to individual target domains substantially outperforms generic averaging strategies, while synthetic data augmentation provides minimal benefit. The resulting SUPERNOVA dataset improves Qwen3-0.6B performance by 64.4 percentage points on BigBench Extra Hard, a benchmark featuring 23 complex reasoning tasks spanning diverse domains. Critically, these gains transfer to larger models, newer architectures, and unseen evaluation sets, indicating genuine reasoning improvement rather than overfitting. For the AI development community, this work provides actionable guidance on dataset curation methodology. By showing that human-annotated instruction data—already widely available—can effectively train general reasoning capabilities, SUPERNOVA reduces the practical barrier to extending RLVR approaches beyond mathematics and programming. This democratizes access to reasoning improvements across model scales. The generalization findings suggest that strategic data selection matters more than raw data volume or synthetic augmentation, shifting focus toward intelligent curation practices. The public release of models, data, and code may accelerate adoption of these techniques across the industry.

Key Takeaways
  • SUPERNOVA curates 25K RLVR training instances from natural instruction datasets, achieving 64.4pp gains on complex reasoning benchmarks
  • Task selection based on individual target domain performance significantly outperforms strategies using overall average performance metrics
  • Synthetic data interventions do not improve reasoning performance, indicating that quality human annotation drives RLVR success
  • Reasoning improvements generalize across unseen benchmarks, larger model scales, and newer model families beyond the training setup
  • The framework provides practical methodology for extending reinforcement learning approaches from STEM to general reasoning domains
Read Original →via arXiv – CS AI
Act on this with AI
Stay ahead of the market.
Connect your wallet to an AI agent. It reads balances, proposes swaps and bridges across 15 chains — you keep full control of your keys.
Connect Wallet to AI →How it works
Related Articles