AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers introduce DOMINO, a framework that synthesizes domain-specific training data for large language models by learning from reference examples rather than explicit domain descriptions. The approach combines prompt tuning with contrastive learning to generate diverse, high-quality synthetic data without manual prompt engineering, improving coding task performance by up to 4.63%.
AIBullisharXiv – CS AI · 2d ago7/10
🧠Researchers propose Feature Activation Coverage (FAC), a new metric for measuring data diversity in large language models using sparse autoencoders instead of traditional text-based metrics. The FAC Synthesis framework generates synthetic training data to fill feature gaps, demonstrating consistent improvements across multiple tasks and revealing transferable feature spaces across different model families.
AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers propose a text-only framework for synthesizing vision-language model training data, eliminating the need for costly image-text pairs. The method generates two datasets (Unicorn-1.2M and Unicorn-471K-Instruction) through a three-stage process that converts text captions into synthetic visual representations, potentially reducing training costs and accelerating VLM development.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce VeriTime, a framework that enhances large language models for time series analysis through synthetic data generation, intelligent data scheduling, and specialized reinforcement learning. The approach enables smaller models (3B-4B parameters) to match or exceed the reasoning capabilities of larger proprietary LLMs on time series tasks.
AIBearisharXiv – CS AI · Mar 56/10
🧠Researchers have identified 'preference leakage,' a contamination problem in LLM-as-a-judge systems where evaluator models show bias toward related data generator models. The study found this bias occurs when judge and generator LLMs share relationships like being the same model, having inheritance connections, or belonging to the same model family.
AIBullisharXiv – CS AI · Mar 47/102
🧠Researchers introduced PC Agent-E, an efficient AI agent training framework that achieves human-like computer use with minimal human demonstration data. Starting with just 312 human-annotated trajectories and augmenting them with Claude 3.7 Sonnet synthesis, the model achieved 141% relative improvement and outperformed Claude 3.7 Sonnet by 10% on WindowsAgentArena-V2 benchmark.
AIBullishGoogle Research Blog · Aug 147/106
🧠The article discusses advancements in generative AI focusing on data synthesis using conditional generators. This approach aims to address computational challenges associated with billion-parameter models by providing more efficient alternatives for data generation.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers developed a novel framework for synthesizing training data that enables reasoning models to generate high-quality mathematical and reasoning problems by explicitly planning problem directions and adapting difficulty to solver capabilities. The approach achieved a 3.4% cumulative improvement across 10 benchmarks, demonstrating scalable alternatives to manual dataset curation.
AIBullisharXiv – CS AI · Mar 276/10
🧠Researchers developed lightweight generative AI models for creating synthetic network traffic data to address privacy concerns and data scarcity in network traffic classification. The models achieved up to 87% F1-score when classifiers were trained solely on synthetic data, with transformer-based approaches providing the best balance of accuracy and computational efficiency.
AIBullisharXiv – CS AI · Mar 276/10
🧠Researchers introduce ArtiAgent, an automated system that creates pairs of real and artifact-injected images to help AI models better detect and fix visual artifacts in generated content. The system uses three specialized agents to synthesize 100K annotated images, addressing the costly and scaling challenges of human-labeled artifact datasets.
AIBullisharXiv – CS AI · Mar 36/104
🧠Researchers have developed RawMed, the first framework to generate synthetic multi-table time-series Electronic Health Records (EHR) that closely resembles raw medical data. The system addresses privacy concerns in healthcare data sharing while maintaining fidelity and utility, outperforming baseline models in validation tests.
AIBullisharXiv – CS AI · Mar 26/1014
🧠Researchers introduce MMKG-RDS, a framework that uses multimodal knowledge graphs to synthesize high-quality training data for improving AI model reasoning abilities. Testing on Qwen3 models showed 9.2% improvement in reasoning accuracy, with applications for complex benchmark construction involving tables and formulas.