#data-efficiency News & Analysis

53 articles tagged with #data-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

53 articles

AINeutralarXiv – CS AI · Jun 56/10

🧠

Residual Modeling for High-Fidelity Learned Compression of Scientific Data

Researchers present novel residual-centric compression methods (LBRC and NGLR) for scientific data that improve upon existing learned compression approaches by tailoring the encoding of reconstruction residuals to their structural properties. The techniques achieve 30-60% better compression ratios than Guaranteed Autoencoders and outperform the SZ compressor in high-fidelity regimes, addressing a critical bottleneck in compressing massive spatiotemporal datasets from scientific simulations.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study

Researchers propose reformulating infrastructure inspection as image difference classification (IDC) rather than traditional defect detection, leveraging digital twins to reduce annotated data requirements. A traffic sign case study demonstrates that instruction-based classifiers outperform encoder-based alternatives when comparing images against reference baselines, offering practical applications for low-resource infrastructure monitoring.

AIBullisharXiv – CS AI · Jun 56/10

🧠

Towards Unified and Data-Efficient Prognostics and Health Management with Tabular Foundation Models

Researchers propose applying Tabular Foundation Models to industrial Prognostics and Health Management (PHM) tasks by converting time-series signals into tabular representations. The approach demonstrates superior performance across diagnostics and prognostics compared to sequence models and transformers, while achieving high data efficiency in low-data industrial settings.

AINeutralarXiv – CS AI · Jun 56/10

🧠

Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization

Researchers propose EBiEOT, a novel semi-supervised learning framework that leverages both paired and unpaired data through likelihood maximization and inverse entropic optimal transport. The method demonstrates universal approximation properties and provides an end-to-end algorithm for learning conditional distributions, with potential applications in domain translation and other data-scarce scenarios.

AIBullisharXiv – CS AI · Jun 46/10

🧠

Smart Picks in the Dark: Towards Efficient RLVR for Reasoning via Tracing Metacognitive Pivots

Researchers propose PivotTrace, a data-efficient framework for training large reasoning models that selects unlabeled samples for annotation without prior supervision. The method achieves 29.3% annotation efficiency while converging 2.75x faster than standard supervised approaches by leveraging attention dynamics to quantify uncertainty.

AIBullisharXiv – CS AI · Jun 46/10

🧠

GeoMin: Data-Efficient Semi-Supervised RLVR via Geometric Distribution Modeling

GeoMin, a new semi-supervised reinforcement learning method, advances LLM reasoning by using geometric distribution modeling to better utilize unlabeled data. The approach achieves 4.1% performance gains over existing methods and matches fully supervised models with only 10% of the annotation data, significantly improving data efficiency in AI training.

AINeutralarXiv – CS AI · Jun 25/10

🧠

ChurnNet: A Optimized Modern AI for Churn Prediction

A new study comparing machine learning approaches for churn prediction finds that traditional methods like Random Forests and XGBoost outperform advanced deep learning models in predictive accuracy, efficiency, and computational resource requirements. The research challenges the assumption that complex temporal models are always superior for time-series classification tasks in customer retention.

AINeutralarXiv – CS AI · Jun 26/10

🧠

Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies

Researchers introduce Physics-Informed Deep Learning (PIDL), a unified neural framework that enforces both differential equations and thermodynamic constraints simultaneously across different physical domains. The framework demonstrates exceptional data efficiency and zero Second Law violations in both thermodynamic and financial modeling applications.

AIBullisharXiv – CS AI · May 296/10

🧠

Demystifying Data Organization for Enhanced LLM Training

Researchers have developed novel data organization methods (STR and SAW) for improving LLM training efficiency by strategically ordering training data using pre-computed sample-level scores. The study formalized four key guidelines—Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity—and validated their effectiveness across multiple model scales, offering practical improvements to training stability with minimal computational overhead.

AIBullisharXiv – CS AI · May 286/10

🧠

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

Researchers demonstrate that a 0.6B-parameter ASR model trained on 100k hours of speech can achieve competitive performance with larger models through teacher-guided on-policy distillation, reducing the audio data requirements by 99.5% compared to industry standards while closing the capability gap with 1.7B parameter models.

AINeutralarXiv – CS AI · May 276/10

🧠

LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

Researchers introduce LUCoS, an unsupervised method for selecting training instances in tabular machine learning that uses latent embeddings rather than raw features. The approach significantly outperforms random selection across 67 datasets, addressing a critical cold-start problem in tabular foundation models like TabPFN.

AIBullisharXiv – CS AI · May 126/10

🧠

M$^3$: Reframing Training Measures for Discretized Physical Simulations

Researchers introduce M³ (Multi-scale Morton Measure), a framework that improves neural surrogate models for physical simulations by addressing training bias from discretized data sampling. The method achieves up to 4.7× error reduction in volumetric cases and maintains superior performance even with 90% data reduction, demonstrating that data distribution strategy significantly impacts operator learning efficiency.

AINeutralarXiv – CS AI · May 126/10

🧠

Ace-Skill: Bootstrapping Multimodal Agents with Prioritized and Clustered Evolution

Researchers introduce Ace-Skill, a co-evolutionary framework that improves multimodal AI agents by optimizing both data sampling and knowledge organization. The system achieves 35% performance gains on tool-use benchmarks and enables smaller models to inherit capabilities from larger ones without additional training.

AINeutralarXiv – CS AI · May 126/10

🧠

Probing the Impact of Scale on Data-Efficient, Generalist Transformer World Models for Atari

Researchers demonstrate that transformer-based world models exhibit distinct scaling behaviors across Atari environments, with joint multi-task training stabilizing performance gains. The study reveals that individual environments respond differently to model scaling, but unified training across 26 Atari games ensures consistent improvements regardless of inherent task complexity.

AIBullisharXiv – CS AI · May 126/10

🧠

PromptDx: Differentiable Prompt Tuning for Multimodal In-Context Alzheimer's Diagnosis

Researchers introduce PromptDx, a novel AI framework that combines differentiable prompt tuning with multimodal learning to diagnose Alzheimer's Disease using MRI and biomarker data. The method achieves competitive performance using only 1% of context samples compared to 30% in standard approaches, demonstrating significant data efficiency gains for medical imaging applications.

AIBullisharXiv – CS AI · May 126/10

🧠

Semi-Supervised Neural Super-Resolution for Mesh-Based Simulations

Researchers introduce SuperMeshNet, a semi-supervised neural network framework that dramatically reduces the amount of expensive high-resolution training data needed for mesh-based simulations. By combining small paired datasets with abundant unpaired data through complementary learning, the system achieves superior accuracy while requiring 90% less supervised training data than fully supervised approaches.

AINeutralarXiv – CS AI · May 116/10

🧠

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

Researchers identify a critical flaw in robotic manipulation training: collecting diverse single-shot demonstrations paradoxically degrades performance due to estimation noise. Their proposed Anchor-Centric Adaptation (ACA) framework prioritizes repeated demonstrations at core tasks before expanding coverage, significantly improving robot reliability under strict data budgets.

AINeutralarXiv – CS AI · May 116/10

🧠

Accelerated and data-efficient flow prediction in stirred tanks via physics-informed learning

Researchers demonstrate that physics-informed machine learning can predict fluid flows in industrial stirred tanks with significantly less training data than purely data-driven approaches. The study reveals diminishing returns in accuracy beyond moderate dataset sizes, with physics-based constraints proving most valuable in low-data regimes.

AINeutralarXiv – CS AI · May 116/10

🧠

Graph-Structured Hyperdimensional Computing for Data-Efficient and Explainable Process-Structure-Property Prediction

Researchers developed PSP-HDC, a graph-structured hyperdimensional computing framework for predicting material properties in 3D microstructure fabrication with sparse, heterogeneous data. The approach achieves 91% accuracy while providing inherent explainability—a critical advantage over conventional machine learning models that struggle with limited datasets and poor generalization.

AINeutralarXiv – CS AI · May 116/10

🧠

TopoPrune: Robust Data Pruning via Unified Latent Space Topology

TopoPrune introduces a topology-based framework for data pruning that addresses instability issues in geometric methods by leveraging intrinsic data structure rather than extrinsic geometry. The approach combines manifold approximation with persistent homology to achieve high accuracy at extreme pruning rates (90%) while maintaining robustness across architectures and noise conditions.

AINeutralarXiv – CS AI · Apr 206/10

🧠

Distribution Shift Alignment Helps LLMs Simulate Survey Response Distributions

Researchers introduced Distribution Shift Alignment (DSA), a novel fine-tuning method that enables large language models to more accurately simulate human survey responses by learning distribution patterns rather than memorizing training data. DSA outperforms existing methods across five public datasets and reduces required real-world data by 53-69%, offering significant cost savings for large-scale survey research.

AIBullisharXiv – CS AI · Mar 36/106

🧠

VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning

Researchers developed VisNec, a framework that identifies which training samples truly require visual reasoning for multimodal AI instruction tuning. The method achieves equivalent performance using only 15% of training data by filtering out visually redundant samples, potentially making multimodal AI training more efficient.

AIBullisharXiv – CS AI · Mar 26/1014

🧠

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Researchers propose a data-efficient framework to convert generative Multimodal Large Language Models into universal embedding models without extensive pre-training. The method uses hierarchical embedding prompts and Self-aware Hard Negative Sampling to achieve competitive performance on embedding benchmarks using minimal training data.

AIBullisharXiv – CS AI · Mar 27/1014

🧠

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Researchers propose MetaAPO, a new framework for aligning large language models with human preferences that dynamically balances online and offline training data. The method uses a meta-learner to evaluate when on-policy sampling is beneficial, resulting in better performance while reducing online annotation costs by 42%.

AIBullisharXiv – CS AI · Feb 276/105

🧠

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

Researchers introduced NoRD (No Reasoning for Driving), a Vision-Language-Action model for autonomous driving that achieves competitive performance using 60% less training data and no reasoning annotations. The model incorporates Dr. GRPO algorithm to overcome difficulty bias issues in reinforcement learning, demonstrating successful results on Waymo and NAVSIM benchmarks.

← PrevPage 2 of 3Next →