#data-augmentation News & Analysis

37 articles tagged with #data-augmentation. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

37 articles

AIBullisharXiv – CS AI · 1d ago7/10

🧠

Anatomically-conditioned Latent Diffusion Model for Data-Efficient Few-Shot Cross-Domain 3D Glioma MRI Synthesis

Researchers propose ALDM, an anatomically-conditioned latent diffusion model that synthesizes 3D brain MRI scans from limited data to improve glioma classification across medical imaging centers. The framework achieves superior synthetic image quality and clinical classification performance with only 16 target images, addressing a critical challenge in medical AI where domain shifts and data scarcity limit model generalization.

AIBullisharXiv – CS AI · 3d ago7/10

🧠

AI-Augmented Thyroid Scintigraphy for Robust Classification of Disease

Researchers demonstrate that Flow Matching generative models outperform Stable Diffusion and conventional augmentation techniques for classifying thyroid scintigraphy images, achieving F1-scores of 0.78 and AUC of 0.95. The study validates that advanced AI-generated synthetic medical images can effectively address dataset limitations in diagnostic imaging tasks.

🧠 Stable Diffusion

AIBearisharXiv – CS AI · Jun 197/10

🧠

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

Researchers conducted a rigorous controlled benchmark comparing quantum and classical generative models for augmenting brain MRI datasets. The study found no statistically significant performance difference between quantum and classical generators, and neither provided meaningful benefits over real-data-only training across various data scarcity scenarios.

AIBullisharXiv – CS AI · Jun 97/10

🧠

Beyond Accuracy: Interpreting Topic Representation in Suicide Ideation Detection Models

Researchers demonstrate that suicide ideation detection models trained with topic-augmented datasets develop more interpretable internal representations of psychological risk factors. The study moves beyond standard accuracy metrics to examine how AI systems encode mental health concepts, revealing that augmentation clarifies underrepresented factors like immigration stress, family issues, and financial crisis.

AIBullisharXiv – CS AI · Jun 57/10

🧠

Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation

Researchers demonstrate that synthetic fMRI data generated by TRIBE v2, a large pretrained encoding model, can significantly improve brain-to-image decoding performance in low-data scenarios, achieving up to 68% improvement in accuracy. The findings suggest that foundation models trained on extensive neural data can enhance data efficiency for brain decoding tasks and enable zero-shot capabilities.

AIBullisharXiv – CS AI · May 127/10

🧠

CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

Researchers introduce CauSim, a framework that enables large language models to improve causal reasoning by constructing increasingly complex executable causal simulators. The approach transforms causal reasoning from a scarce-data problem into a scalable supervised learning task, allowing LLMs to generate synthetic training data and demonstrate improved performance across different representations.

AIBullisharXiv – CS AI · Apr 207/10

🧠

Large Language Models for Market Research: A Data-augmentation Approach

Researchers propose a novel statistical framework for integrating Large Language Model-generated data with real human data in conjoint analysis, addressing the bias gap between synthetic and authentic consumer responses. The approach delivers 24.9-79.8% cost and data savings while maintaining statistical robustness, validating that LLM data serves as a complement rather than substitute for human market research.

AINeutralarXiv – CS AI · 3d ago6/10

🧠

Enhancing Protein Representation Learning via Manifold Restore Mixing

Researchers propose Manifold Restore Mixing (MRM), a novel data augmentation method that addresses structural degradation issues in protein representation learning by mixing hidden representations of original and augmented protein data. The approach combines manifold mixup techniques with a difficulty scheduler to generate training samples that preserve protein structure while introducing beneficial variations.

AINeutralarXiv – CS AI · Jun 195/10

🧠

Leveraging systems' non-linearity to tackle the scarcity of data in the design of Intelligent Fault Diagnosis Systems

Researchers propose a novel Deep Transfer Learning approach for Intelligent Fault Diagnosis Systems that addresses data scarcity by leveraging system non-linearities and multi-excitation vibration analysis. The method combines pre-trained CNNs with a new data visualization and augmentation technique, validated on railway pantograph structures.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Improving Code-Switching ASR with Code-Mixing Guided Synthetic Speech

Researchers propose a code-mixing guided synthetic speech generation framework to improve automatic speech recognition (ASR) for multilingual code-switching scenarios. By optimizing synthetic data generation using the Code Mixing Index metric, the method demonstrates significant error rate reductions on Mandarin-English speech datasets, addressing a critical limitation in training data availability for code-switched ASR systems.

AINeutralarXiv – CS AI · Jun 196/10

🧠

Improving End-to-End Speech Recognition for Dysarthric Speech through In-Domain Data Augmentation

Researchers developed data augmentation techniques to improve automatic speech recognition (ASR) for people with dysarthria by fine-tuning the Wav2Vec2 model. Using methods like speaking-rate modification, pitch modification, and formant modification tailored to different severity levels, the study achieved significant word error rate reductions across low, medium, and high severity dysarthric speech.

AINeutralarXiv – CS AI · Jun 106/10

🧠

++nnU-Net: Scaling nnU-Net with Prefix-Based Data Augmentation

Researchers introduce ++nnU-Net, an enhanced medical image segmentation framework that uses registration-based data augmentation to improve upon the standard nnU-Net architecture. The method demonstrates performance gains up to 22% in Dice Similarity Coefficient scores across five 2D datasets, addressing the critical challenge of limited annotated medical imaging data.

AIBullisharXiv – CS AI · Jun 106/10

🧠

UPLOTS: A Unified Pretrained Language Model for Constrained Time-series Generation

UPLOTS is a unified pre-trained language model that generates constrained time-series data across multiple domains using a single transformer backbone guided by learned prompts. The framework addresses scalability limitations of existing domain-specific approaches by internalizing diverse temporal structures and enabling conditional generation with precise pattern control.

AINeutralarXiv – CS AI · Jun 96/10

🧠

SNR-ST-Mix: Sample-specific Neighborhood Regression Mixup for Augmented Spatial Transcriptomics Imputation with Deep Neural Network

Researchers introduce SNR-ST-Mix, a data augmentation framework designed specifically for spatial transcriptomics that uses geometry-aware and expression-aware mixing to improve deep neural network performance. The method constrains data interpolation to k-nearest spatial neighbors and weights coefficients by expression similarity, enabling more biologically plausible synthetic training samples that enhance prediction accuracy without architectural changes.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Visual Prompting Meets Feature Reconstruction-Based Anomaly Detection with Dual-Teacher Supervision

Researchers introduce a novel anomaly detection framework combining visual prompting, unfrozen teacher models, and diffusion-based data augmentation to address real-world limitations in industrial inspection systems. The approach achieves a 3.5 percentage point improvement on the challenging AeBAD dataset, demonstrating practical applicability beyond controlled laboratory conditions.

AIBullisharXiv – CS AI · Jun 96/10

🧠

Large Language Models for Imbalanced Classification: Diversity makes the difference

Researchers have developed a novel LLM-based oversampling method to address imbalanced classification in machine learning, focusing on generating diverse synthetic minority samples. The approach outperforms existing methods like SMOTE by preserving categorical information and introducing enhanced diversity through novel sampling and fine-tuning strategies.

AINeutralarXiv – CS AI · Jun 86/10

🧠

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

Researchers demonstrate that synthetic MRI images generated by conditional neural networks can effectively augment training datasets for automated focal cortical dysplasia detection, reducing the need for manual annotations by approximately 20% while maintaining diagnostic sensitivity. Expert radiologists struggled to distinguish synthetic from real images, validating the realism of generated data, though real data remains superior when available.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Binary Gaussian Copula Synthesis: an LLM-powered data augmentation framework for early dialysis prediction in chronic kidney disease

Researchers developed Binary Gaussian Copula Synthesis (BGCS), an LLM-augmented data augmentation method that addresses severe class imbalance in chronic kidney disease datasets to improve early dialysis prediction. Tested on 15,169 CKD patients, BGCS outperformed existing methods like SMOTE and CTGAN, achieving 78-87% minority-class recall and enabling deployment in interpretable clinical decision-support systems.

AINeutralarXiv – CS AI · Jun 46/10

🧠

OA-CutMix: Correcting the Label Bias of CutMix

Researchers propose Object-Aware CutMix (OA-CutMix), a corrected version of the widely-used CutMix data augmentation technique that fixes a fundamental labeling bias where patch area doesn't accurately reflect semantic contribution. The method uses segmentation masks to assign labels proportional to visible object area, consistently outperforming existing mixing methods across multiple architectures and datasets.

AIBullisharXiv – CS AI · Jun 26/10

🧠

A Novel Data Augmentation Strategy for Robust Deep Learning Classification of Biomedical Time-Series Data: Application to ECG and EEG Analysis

Researchers propose a unified deep learning framework combining ResNet-based CNNs with attention mechanisms and novel data augmentation techniques for analyzing biomedical time-series signals like ECG and EEG. The approach achieves near-perfect accuracy (99.78-100%) on benchmark datasets while remaining lightweight enough for wearable deployment, addressing critical gaps in multi-signal analysis and class imbalance handling.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Synthetic Data from Cross-Domain Events for Large-Scale Recommendation Systems

Researchers introduce SCALR, a framework that generates synthetic user-item interaction data across recommendation system domains by leveraging observed events from source domains. The approach addresses data sparsity challenges in large-scale recommendation systems and demonstrates statistically significant improvements in industrial A/B testing.

AIBullisharXiv – CS AI · Jun 16/10

🧠

Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models

Researchers propose a histogram-regularized latent diffusion model that synthesizes realistic lung nodules in 3D CT volumes while accurately preserving intensity distributions characteristic of different nodule subtypes. The method addresses limitations in existing generative approaches by constraining lesion-level intensity profiles during synthesis, enabling improved data augmentation for cancer screening systems and better performance on underrepresented nodule types.

AIBullisharXiv – CS AI · May 296/10

🧠

GiPL: Generative augmented iterative Pseudo-Labeling for Cross-Domain Few-Shot Object Detection

Researchers propose GiPL, a two-branch machine learning framework that combines iterative pseudo-labeling with generative data augmentation to improve cross-domain few-shot object detection using vision-language models. The method demonstrates significant performance improvements on three benchmark datasets, addressing critical challenges in fine-tuning with limited target-domain samples.

AINeutralarXiv – CS AI · May 296/10

🧠

A Survey on Recent Advances in Conversational Data Generation

A comprehensive survey examines recent advances in synthetic dialogue data generation for conversational AI systems, addressing the challenge of data scarcity in training. The research categorizes methods across open-domain, task-oriented, and information-seeking dialogue systems, proposing a framework for generating multi-turn conversations at scale while maintaining quality standards.

AIBullisharXiv – CS AI · May 296/10

🧠

Taming Data Challenges in ML-based Security Tasks Using Generative AI

Researchers propose using Generative AI to augment training datasets with synthetic data, improving machine learning security classifiers by up to 32.6% even with minimal training samples. The study evaluates six state-of-the-art GenAI methods across seven security tasks and introduces Nimai, a novel controlled data synthesis scheme, while identifying limitations in GenAI applicability to certain security domains.

Page 1 of 2Next →