#data-generation News & Analysis

12 articles tagged with #data-generation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

12 articles

AIBullisharXiv – CS AI · Jun 257/10

🧠

Autodata: An agentic data scientist to create high quality synthetic data

Autodata introduces an AI-powered method where agents act as data scientists to autonomously generate high-quality synthetic training and evaluation data. The approach, implemented through Agentic Self-Instruct, demonstrates improved performance over traditional synthetic data creation methods across computer science, legal reasoning, and mathematical reasoning tasks, with further gains achieved through meta-optimization of the data scientist agent itself.

AIBullisharXiv – CS AI · May 287/10

🧠

HumanoidMimicGen: Data Generation for Loco-Manipulation via Whole-Body Planning

Researchers introduce HumanoidMimicGen, a method for automatically generating training data for humanoid robots performing complex locomotion and manipulation tasks. The approach enables imitation learning at scale without labor-intensive teleoperation, achieving 20% performance improvements over models trained solely on real-world demonstrations.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Solving Physics Olympiad via Reinforcement Learning on Physics Simulators

Researchers demonstrate that physics simulators can generate synthetic training data for large language models, enabling them to learn physical reasoning without relying on scarce internet QA pairs. Models trained on simulated data show 5-10 percentage point improvements on International Physics Olympiad problems, suggesting simulators offer a scalable alternative for domain-specific AI training.

AIBullisharXiv – CS AI · Mar 56/10

🧠

JANUS: Structured Bidirectional Generation for Guaranteed Constraints and Analytical Uncertainty

Researchers introduce JANUS, a new AI framework that solves the 'Quadrilemma' in synthetic data generation by achieving high fidelity, logical constraint control, reliable uncertainty estimation, and computational efficiency simultaneously. The system uses Bayesian Decision Trees and a novel Reverse-Topological Back-filling algorithm to guarantee 100% constraint satisfaction while being 128x faster than existing methods.

AIBullishHugging Face Blog · Mar 207/108

🧠

Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models

The article discusses Cosmopedia, a methodology for generating large-scale synthetic data specifically designed for pre-training Large Language Models. This approach addresses the challenge of obtaining sufficient high-quality training data by creating artificial datasets that can supplement or replace traditional web-scraped content.

AINeutralarXiv – CS AI · May 296/10

🧠

Make LLM Learn to Synthesize from Streaming Experiences through Feedback

Researchers introduce StreamSynth, a new framework enabling large language models to learn and improve synthetic data generation across sequential tasks by accumulating experience and transferring knowledge between related synthesis problems. The SynLearner framework demonstrates that LLMs can leverage historical task insights to enhance future data generation quality, establishing synthetic data creation as an experience-driven process rather than isolated operations.

AIBullisharXiv – CS AI · May 296/10

🧠

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

GenesisFunc presents an automated pipeline for generating high-quality synthetic training data for LLM function-calling capabilities, addressing limitations in existing data generation methods. The approach uses a multi-agent framework to create diverse, validated datasets that enable smaller LLMs (8B parameters) to match or exceed the function-calling performance of larger proprietary models.

AINeutralarXiv – CS AI · May 126/10

🧠

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

Researchers propose a mid-training technique using self-generated data to improve reinforcement learning in large language models. By exposing models to multiple problem-solving approaches before RL training, the method demonstrates consistent improvements across mathematical reasoning, code generation, and narrative tasks.

AINeutralarXiv – CS AI · Mar 176/10

🧠

Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows

Researchers introduce the Infinite Problem Generator (IPG), an AI framework that creates verifiable physics problems using executable Python code instead of probabilistic text generation. The system released ClassicalMechanicsV1, a dataset of 1,335 physics problems that demonstrates how code complexity can precisely measure problem difficulty for training large language models.

AINeutralarXiv – CS AI · Mar 44/102

🧠

Interaction Field Matching: Overcoming Limitations of Electrostatic Models

Researchers propose Interaction Field Matching (IFM), a generalization of Electrostatic Field Matching that uses physics-inspired interaction fields for data generation and transfer. The method addresses modeling challenges in neural networks by drawing inspiration from quark interactions in physics.

AINeutralarXiv – CS AI · Feb 274/103

🧠

TabDLM: Free-Form Tabular Data Generation via Joint Numerical-Language Diffusion

Researchers introduce TabDLM, a new AI framework that generates synthetic tabular data containing both numerical values and free-form text using joint numerical-language diffusion models. The approach addresses limitations of existing diffusion and LLM-based methods by combining masked diffusion for text with continuous diffusion for numbers, enabling better synthetic data generation for privacy and data augmentation applications.

AINeutralHugging Face Blog · Dec 164/106

🧠

Introducing the Synthetic Data Generator - Build Datasets with Natural Language

The article title suggests the introduction of a synthetic data generator tool that allows users to build datasets using natural language commands. However, no article body content was provided for analysis.