🧠 AI⚪ NeutralImportance 6/10

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

arXiv – CS AI|Haowen Wang, Yaxin Du, Jian Yang, Jiajun Wu, Shukai Liu, Yuxuan Zhang, Pingjie Wang, Siheng Chen, Tuney Zheng, Ming Zhou, Xianglong Liu|May 29, 2026 at 04:00 AM

🤖AI Summary

Researchers introduce MIRA, a framework for optimizing data selection during mid-training of large language models by dynamically discovering and applying source-specific evaluation rubrics. The approach achieves comparable performance to full-corpus training while reducing token usage by 50% on code-oriented tasks across 21 diverse data sources.

Analysis

MIRA addresses a critical bottleneck in modern LLM development: the mid-training phase sits between pretraining and post-training, requiring data curation that balances scalability with semantic quality across heterogeneous sources. Traditional approaches either sacrifice semantic rigor for computational efficiency or assume standardized data formats that don't reflect real-world training pipelines. The framework's innovation lies in making rubric construction adaptive rather than fixed—it discovers what qualities matter for each source group, then distills those judgments into lightweight scorers for full-corpus filtering.

This work reflects the increasing sophistication of LLM training pipelines. As models grow larger and more capable, the marginal value of raw data quantity diminishes relative to strategic data curation. Mid-training has emerged as a critical leverage point where targeted mixtures of curated data can efficiently strengthen specific capabilities before final alignment. MIRA's source-aware approach acknowledges that code, documents, and domain-specific texts require different evaluation criteria—a practical constraint that generic filtering methods overlook.

The results carry meaningful implications for both efficiency and accessibility in AI development. Achieving equivalent performance with 50% fewer tokens translates directly to reduced computational costs, energy consumption, and training timelines. This efficiency gain matters for organizations scaling LLM capabilities without proportional increases in infrastructure investment. The framework also suggests a path toward more systematic, interpretable data selection rather than black-box model-based approaches, potentially enabling better understanding of which data sources drive specific capabilities.

Key Takeaways

→MIRA enables source-adaptive data selection by discovering evaluation rubrics specific to each data source group rather than applying fixed criteria
→The framework achieves equivalent performance to full-corpus training while reducing token consumption by 50% on code benchmarks
→Self-anchored rubric discovery allows semantic quality signals to guide scalable student scorers, balancing computational efficiency with semantic rigor
→The approach addresses the distinct data selection problem of mid-training where sources have heterogeneous formats and different training roles
→Results span 21 sources across 9 code benchmarks, demonstrating practical effectiveness in complex, real-world training scenarios