#data-curation News & Analysis

13 articles tagged with #data-curation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

13 articles

AIBullisharXiv – CS AI · 3d ago7/10

🧠

DecomposeRL: Learning to Ask Useful, Informative, and Diverse Questions for Semi-Supervised, Traceable Claim Verification

DecomposeRL presents a novel reinforcement learning approach to claim verification that achieves high accuracy while maintaining interpretability through decomposition-based reasoning. A 7B parameter model trained on just 5K curated claims matches 32B baselines and GPT-4.1-mini across 11 benchmarks while enabling semi-supervised learning, demonstrating efficient scaling through intelligent data curation.

🧠 GPT-4

AIBullisharXiv – CS AI · May 97/10

🧠

Rethinking Data Curation in LLM Training: Online Reweighting Offers Better Generalization than Offline Methods

Researchers propose ADAPT, an online data reweighting framework that dynamically adjusts training sample importance during LLM training rather than using static offline selection methods. This approach maintains data diversity while improving generalization, outperforming existing offline curation techniques on instruction tuning and large-scale pretraining tasks.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Pioneer Agent: Continual Improvement of Small Language Models in Production

Researchers introduce Pioneer Agent, an automated system that continuously improves small language models in production by diagnosing failures, curating training data, and retraining under regression constraints. The system demonstrates significant performance gains across benchmarks, with real-world deployments achieving improvements from 84.9% to 99.3% in intent classification.

AINeutralarXiv – CS AI · Apr 147/10

🧠

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Researchers introduce WIMHF, a method using sparse autoencoders to decode what human feedback datasets actually measure and express about AI model preferences. The technique identifies interpretable features across 7 datasets, revealing diverse preference patterns and uncovering potentially unsafe biases—such as LMArena users voting against safety refusals—while enabling targeted data curation that improved safety by 37%.

AIBullisharXiv – CS AI · Apr 107/10

🧠

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

Researchers introduce SAVANT, a model-agnostic framework that improves Vision Language Models' ability to detect semantic anomalies in autonomous driving scenarios by 18.5% through structured reasoning instead of ad hoc prompting. The team used this approach to label 10,000 real-world images and fine-tuned an open-source 7B model achieving 90.8% recall, demonstrating practical deployment feasibility without proprietary model dependency.

AIBullisharXiv – CS AI · Mar 177/10

🧠

Data Darwinism Part II: DataEvolve -- AI can Autonomously Evolve Pretraining Data Curation

Researchers introduced DataEvolve, an AI framework that autonomously evolves data curation strategies for pretraining datasets through iterative optimization. The system processed 672B tokens to create Darwin-CC dataset, which achieved superior performance compared to existing datasets like DCLM and FineWeb-Edu when training 3B parameter models.

AIBullisharXiv – CS AI · Mar 97/10

🧠

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Researchers introduce DataChef-32B, an AI system that uses reinforcement learning to automatically generate optimal data processing recipes for training large language models. The system eliminates the need for manual data curation by automatically designing complete data pipelines, achieving performance comparable to human experts across six benchmark tasks.

AIBullisharXiv – CS AI · Mar 57/10

🧠

Phi-4-reasoning-vision-15B Technical Report

Researchers released Phi-4-reasoning-vision-15B, a compact open-weight multimodal AI model that combines vision and language capabilities with strong performance in scientific and mathematical reasoning. The model demonstrates that careful architecture design and high-quality data curation can enable smaller models to achieve competitive performance with less computational resources.

AIBullisharXiv – CS AI · 4d ago6/10

🧠

GEM: Geometric Entropy Mixing for Optimal LLM Data Curation

Researchers introduce GEM (Geometric Entropy Mixing), a novel framework for optimizing LLM training data composition by treating curation as a variational problem on hyperspheres rather than relying on traditional Euclidean clustering. The method achieves up to 1.2% improvements in downstream accuracy on 1.1B-parameter models and provides a more interpretable approach to semantic data organization.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Researchers introduce a multi-agent framework to map data lineage in large language models, revealing how post-training datasets evolve and interconnect. The analysis uncovers structural redundancy, benchmark contamination propagation, and proposes lineage-aware dataset construction to improve LLM training diversity and quality.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

Researchers present Data Mixing Agent, an AI framework that uses reinforcement learning to automatically optimize how large language models balance training data from source and target domains during continual pre-training. The approach outperforms manual reweighting strategies while generalizing across different models, domains, and fields without requiring retraining.

AINeutralarXiv – CS AI · Apr 146/10

🧠

Can Small Training Runs Reliably Guide Data Curation? Rethinking Proxy-Model Practice

Researchers demonstrate that small-scale proxy models commonly used by AI companies to evaluate data curation strategies produce unreliable conclusions because optimal training configurations are data-dependent. They propose using reduced learning rates in proxy model training as a simple, cost-effective solution that better predicts full-scale model performance across diverse data recipes.

🏢 Meta

AIBullisharXiv – CS AI · Mar 27/1020

🧠

MobileLLM-R1: Exploring the Limits of Sub-Billion Language Model Reasoners with Open Training Recipes

Researchers developed MobileLLM-R1, a sub-billion parameter AI model that demonstrates strong reasoning capabilities using only 2T tokens of high-quality data instead of massive 10T+ token datasets. The 950M parameter model achieves superior performance on reasoning benchmarks compared to larger competitors while using only 11.7% of the training data compared to proprietary models like Qwen3.