#data-synthesis News & Analysis

14 articles tagged with #data-synthesis. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AIBullisharXiv – CS AI · Jun 237/10

🧠

CLI-Universe: Towards Verifiable Task Synthesis Engine for Terminal Agents

Researchers introduce CLI-Universe, a systematic framework for generating high-quality training data for terminal agents by sampling task combinations across multiple capability dimensions and subjecting candidates to rigorous executable verification. Fine-tuning Qwen3-32B on the resulting CLI-Universe-6K dataset achieves state-of-the-art performance on Terminal-Bench 2.0 at 33.4%, outperforming much larger models and demonstrating that structured, high-fidelity data synthesis significantly improves AI agent efficiency.

AIBullisharXiv – CS AI · Jun 27/10

🧠

ANDES: Agent Native Data Evolving Synthesis Tool for Autonomous Instruction Alignment

Researchers introduce ANDES, a framework that enables AI agents to autonomously generate high-quality training data for LLM alignment by abstracting complex data-gathering tasks into a manageable agent skill. The system uses a self-evolving World Tree routing mechanism to help agents navigate noisy web environments and achieve state-of-the-art performance on alignment benchmarks despite computational constraints.

AIBullisharXiv – CS AI · May 297/10

🧠

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Researchers introduce DOMINO, a framework that synthesizes domain-specific training data for large language models by learning from reference examples rather than explicit domain descriptions. The approach combines prompt tuning with contrastive learning to generate diverse, high-quality synthetic data without manual prompt engineering, improving coding task performance by up to 4.63%.

AIBullisharXiv – CS AI · May 297/10

🧠

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

Researchers propose Feature Activation Coverage (FAC), a new metric for measuring data diversity in large language models using sparse autoencoders instead of traditional text-based metrics. The FAC Synthesis framework generates synthetic training data to fill feature gaps, demonstrating consistent improvements across multiple tasks and revealing transferable feature spaces across different model families.

AIBullisharXiv – CS AI · May 287/10

🧠

Text-Only Data Synthesis for Vision Language Model Training

Researchers propose a text-only framework for synthesizing vision-language model training data, eliminating the need for costly image-text pairs. The method generates two datasets (Unicorn-1.2M and Unicorn-471K-Instruction) through a three-stage process that converts text captions into synthetic visual representations, potentially reducing training costs and accelerating VLM development.

AIBullisharXiv – CS AI · May 97/10

🧠

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

Researchers introduce VeriTime, a framework that enhances large language models for time series analysis through synthetic data generation, intelligent data scheduling, and specialized reinforcement learning. The approach enables smaller models (3B-4B parameters) to match or exceed the reasoning capabilities of larger proprietary LLMs on time series tasks.

AIBearisharXiv – CS AI · Mar 56/10

🧠

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Researchers have identified 'preference leakage,' a contamination problem in LLM-as-a-judge systems where evaluator models show bias toward related data generator models. The study found this bias occurs when judge and generator LLMs share relationships like being the same model, having inheritance connections, or belonging to the same model family.

AIBullisharXiv – CS AI · Mar 47/102

🧠

Efficient Agent Training for Computer Use

Researchers introduced PC Agent-E, an efficient AI agent training framework that achieves human-like computer use with minimal human demonstration data. Starting with just 312 human-annotated trajectories and augmenting them with Claude 3.7 Sonnet synthesis, the model achieved 141% relative improvement and outperformed Claude 3.7 Sonnet by 10% on WindowsAgentArena-V2 benchmark.

AIBullishGoogle Research Blog · Aug 147/106

🧠

Beyond billion-parameter burdens: Unlocking data synthesis with a conditional generator

The article discusses advancements in generative AI focusing on data synthesis using conditional generators. This approach aims to address computational challenges associated with billion-parameter models by providing more efficient alternatives for data generation.

AIBullisharXiv – CS AI · May 116/10

🧠

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

Researchers developed a novel framework for synthesizing training data that enables reasoning models to generate high-quality mathematical and reasoning problems by explicitly planning problem directions and adapting difficulty to solver capabilities. The approach achieved a 3.4% cumulative improvement across 10 benchmarks, demonstrating scalable alternatives to manual dataset curation.

AIBullisharXiv – CS AI · Mar 276/10

🧠

Lightweight GenAI for Network Traffic Synthesis: Fidelity, Augmentation, and Classification

Researchers developed lightweight generative AI models for creating synthetic network traffic data to address privacy concerns and data scarcity in network traffic classification. The models achieved up to 87% F1-score when classifiers were trained solely on synthetic data, with transformer-based approaches providing the best balance of accuracy and computational efficiency.

AIBullisharXiv – CS AI · Mar 276/10

🧠

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Researchers introduce ArtiAgent, an automated system that creates pairs of real and artifact-injected images to help AI models better detect and fix visual artifacts in generated content. The system uses three specialized agents to synthesize 100K annotated images, addressing the costly and scaling challenges of human-labeled artifact datasets.

AIBullisharXiv – CS AI · Mar 36/104

🧠

Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing

Researchers have developed RawMed, the first framework to generate synthetic multi-table time-series Electronic Health Records (EHR) that closely resembles raw medical data. The system addresses privacy concerns in healthcare data sharing while maintaining fidelity and utility, outperforming baseline models in validation tests.

AIBullisharXiv – CS AI · Mar 26/1014

🧠

MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs

Researchers introduce MMKG-RDS, a framework that uses multimodal knowledge graphs to synthesize high-quality training data for improving AI model reasoning abilities. Testing on Qwen3 models showed 9.2% improvement in reasoning accuracy, with applications for complex benchmark construction involving tables and formulas.