y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#data-synthesis News & Analysis

12 articles tagged with #data-synthesis. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

12 articles
AIBullisharXiv – CS AI · 2d ago7/10
🧠

Domain-Specific Data Synthesis for LLMs via Minimal Sufficient Representation Learning

Researchers introduce DOMINO, a framework that synthesizes domain-specific training data for large language models by learning from reference examples rather than explicit domain descriptions. The approach combines prompt tuning with contrastive learning to generate diverse, high-quality synthetic data without manual prompt engineering, improving coding task performance by up to 4.63%.

AIBullisharXiv – CS AI · 2d ago7/10
🧠

Less is Enough: Synthesizing Diverse Data in LLM Feature Space with Sparse Autoencoders

Researchers propose Feature Activation Coverage (FAC), a new metric for measuring data diversity in large language models using sparse autoencoders instead of traditional text-based metrics. The FAC Synthesis framework generates synthetic training data to fill feature gaps, demonstrating consistent improvements across multiple tasks and revealing transferable feature spaces across different model families.

AIBullisharXiv – CS AI · 3d ago7/10
🧠

Text-Only Data Synthesis for Vision Language Model Training

Researchers propose a text-only framework for synthesizing vision-language model training data, eliminating the need for costly image-text pairs. The method generates two datasets (Unicorn-1.2M and Unicorn-471K-Instruction) through a three-stage process that converts text captions into synthetic visual representations, potentially reducing training costs and accelerating VLM development.

AIBullisharXiv – CS AI · May 97/10
🧠

Time Series Reasoning via Process-Verifiable Thinking Data Synthesis and Scheduling for Tailored LLM Reasoning

Researchers introduce VeriTime, a framework that enhances large language models for time series analysis through synthetic data generation, intelligent data scheduling, and specialized reinforcement learning. The approach enables smaller models (3B-4B parameters) to match or exceed the reasoning capabilities of larger proprietary LLMs on time series tasks.

AIBearisharXiv – CS AI · Mar 56/10
🧠

Preference Leakage: A Contamination Problem in LLM-as-a-judge

Researchers have identified 'preference leakage,' a contamination problem in LLM-as-a-judge systems where evaluator models show bias toward related data generator models. The study found this bias occurs when judge and generator LLMs share relationships like being the same model, having inheritance connections, or belonging to the same model family.

AIBullisharXiv – CS AI · Mar 47/102
🧠

Efficient Agent Training for Computer Use

Researchers introduced PC Agent-E, an efficient AI agent training framework that achieves human-like computer use with minimal human demonstration data. Starting with just 312 human-annotated trajectories and augmenting them with Claude 3.7 Sonnet synthesis, the model achieved 141% relative improvement and outperformed Claude 3.7 Sonnet by 10% on WindowsAgentArena-V2 benchmark.

AIBullisharXiv – CS AI · May 116/10
🧠

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

Researchers developed a novel framework for synthesizing training data that enables reasoning models to generate high-quality mathematical and reasoning problems by explicitly planning problem directions and adapting difficulty to solver capabilities. The approach achieved a 3.4% cumulative improvement across 10 benchmarks, demonstrating scalable alternatives to manual dataset curation.

AIBullisharXiv – CS AI · Mar 276/10
🧠

Lightweight GenAI for Network Traffic Synthesis: Fidelity, Augmentation, and Classification

Researchers developed lightweight generative AI models for creating synthetic network traffic data to address privacy concerns and data scarcity in network traffic classification. The models achieved up to 87% F1-score when classifiers were trained solely on synthetic data, with transformer-based approaches providing the best balance of accuracy and computational efficiency.

AIBullisharXiv – CS AI · Mar 276/10
🧠

See and Fix the Flaws: Enabling VLMs and Diffusion Models to Comprehend Visual Artifacts via Agentic Data Synthesis

Researchers introduce ArtiAgent, an automated system that creates pairs of real and artifact-injected images to help AI models better detect and fix visual artifacts in generated content. The system uses three specialized agents to synthesize 100K annotated images, addressing the costly and scaling challenges of human-labeled artifact datasets.

AIBullisharXiv – CS AI · Mar 36/104
🧠

Generating Multi-Table Time Series EHR from Latent Space with Minimal Preprocessing

Researchers have developed RawMed, the first framework to generate synthetic multi-table time-series Electronic Health Records (EHR) that closely resembles raw medical data. The system addresses privacy concerns in healthcare data sharing while maintaining fidelity and utility, outperforming baseline models in validation tests.

AIBullisharXiv – CS AI · Mar 26/1014
🧠

MMKG-RDS: Reasoning Data Synthesis via Deep Mining of Multimodal Knowledge Graphs

Researchers introduce MMKG-RDS, a framework that uses multimodal knowledge graphs to synthesize high-quality training data for improving AI model reasoning abilities. Testing on Qwen3 models showed 9.2% improvement in reasoning accuracy, with applications for complex benchmark construction involving tables and formulas.