y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#data-efficiency News & Analysis

17 articles tagged with #data-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

17 articles
AIBullisharXiv – CS AI Β· 2d ago7/10
🧠

Minimal Embodiment Enables Efficient Learning of Number Concepts in Robot

Researchers demonstrate that robots equipped with minimal embodied sensorimotor capabilities learn numerical concepts significantly faster than vision-only systems, achieving 96.8% counting accuracy with 10% of training data. The embodied neural network spontaneously develops biologically plausible number representations matching human cognitive development, suggesting embodiment acts as a structural learning prior rather than merely an information source.

AIBullisharXiv – CS AI Β· 2d ago7/10
🧠

MM-LIMA: Less Is More for Alignment in Multi-Modal Datasets

MM-LIMA demonstrates that multimodal large language models can achieve superior performance using only 200 high-quality instruction examplesβ€”6% of the data used in comparable systems. Researchers developed quality metrics and an automated data selector to filter vision-language datasets, showing that strategic data curation outweighs raw dataset size in model alignment.

AIBullisharXiv – CS AI Β· 3d ago7/10
🧠

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Researchers introduced Webscale-RL, a data pipeline that converts large-scale pre-training documents into 1.2 million diverse question-answer pairs for reinforcement learning training. The approach enables RL models to achieve pre-training-level performance with up to 100x fewer tokens, addressing a critical bottleneck in scaling RL data and potentially advancing more efficient language model development.

AIBullisharXiv – CS AI Β· 6d ago7/10
🧠

Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Researchers demonstrate a data-efficient fine-tuning method for text-to-video diffusion models that enables new generative controls using sparse, low-quality synthetic data rather than expensive, photorealistic datasets. Counterintuitively, models trained on simple synthetic data outperform those trained on high-fidelity real data, supported by both empirical results and theoretical justification.

AIBullisharXiv – CS AI Β· Mar 277/10
🧠

GoldiCLIP: The Goldilocks Approach for Balancing Explicit Supervision for Language-Image Pretraining

Researchers developed GoldiCLIP, a data-efficient vision-language model that achieves state-of-the-art performance using only 30 million images - 300x less data than leading methods. The framework combines three key innovations including text-conditioned self-distillation, VQA-integrated encoding, and uncertainty-based loss weighting to significantly improve image-text retrieval tasks.

AINeutralarXiv – CS AI Β· Mar 177/10
🧠

Uncertainty Quantification and Data Efficiency in AI: An Information-Theoretic Perspective

This research review examines methodologies for addressing AI systems' challenges with limited training data through uncertainty quantification and synthetic data augmentation. The paper presents formal approaches including Bayesian learning frameworks, information-theoretic bounds, and conformal prediction methods to improve AI performance in data-scarce environments like robotics and healthcare.

AIBullisharXiv – CS AI Β· Mar 117/10
🧠

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

Researchers introduce ACTIVEULTRAFEEDBACK, an active learning pipeline that reduces the cost of training Large Language Models by using uncertainty estimates to identify the most informative responses for annotation. The system achieves comparable performance using only one-sixth of the annotated data compared to static baselines, potentially making LLM training more accessible for low-resource domains.

🏒 Hugging Face
AIBullisharXiv – CS AI Β· Mar 56/10
🧠

GIPO: Gaussian Importance Sampling Policy Optimization

GIPO (Gaussian Importance Sampling Policy Optimization) is a new reinforcement learning method that improves data efficiency for training multimodal AI agents. The approach uses Gaussian trust weights instead of hard clipping to better handle scarce or outdated training data, showing superior performance and stability across various experimental conditions.

AIBullisharXiv – CS AI Β· Mar 47/102
🧠

Tether: Autonomous Functional Play with Correspondence-Driven Trajectory Warping

Researchers introduce Tether, a breakthrough method enabling robots to perform autonomous functional play using minimal human demonstrations (≀10). The system generates over 1000 expert-level trajectories through continuous cycles of task execution and improvement, representing a significant advance in autonomous robotics learning.

AIBullishGoogle Research Blog Β· Aug 77/108
🧠

Achieving 10,000x training data reduction with high-fidelity labels

Research demonstrates a breakthrough method for achieving 10,000x reduction in training data requirements while maintaining high-fidelity labels in machine learning systems. This advancement focuses on human-computer interaction and visualization techniques to optimize data efficiency in AI training processes.

AIBullisharXiv – CS AI Β· Mar 36/106
🧠

VisNec: Measuring and Leveraging Visual Necessity for Multimodal Instruction Tuning

Researchers developed VisNec, a framework that identifies which training samples truly require visual reasoning for multimodal AI instruction tuning. The method achieves equivalent performance using only 15% of training data by filtering out visually redundant samples, potentially making multimodal AI training more efficient.

AIBullisharXiv – CS AI Β· Mar 26/1014
🧠

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Researchers propose a data-efficient framework to convert generative Multimodal Large Language Models into universal embedding models without extensive pre-training. The method uses hierarchical embedding prompts and Self-aware Hard Negative Sampling to achieve competitive performance on embedding benchmarks using minimal training data.

AIBullisharXiv – CS AI Β· Feb 276/105
🧠

NoRD: A Data-Efficient Vision-Language-Action Model that Drives without Reasoning

Researchers introduced NoRD (No Reasoning for Driving), a Vision-Language-Action model for autonomous driving that achieves competitive performance using 60% less training data and no reasoning annotations. The model incorporates Dr. GRPO algorithm to overcome difficulty bias issues in reinforcement learning, demonstrating successful results on Waymo and NAVSIM benchmarks.

AIBullisharXiv – CS AI Β· Apr 65/10
🧠

Efficient Causal Graph Discovery Using Large Language Models

Researchers propose a new framework using Large Language Models for causal graph discovery that requires only linear queries instead of quadratic, making it more efficient for larger datasets. The method uses breadth-first search and can incorporate observational data, achieving state-of-the-art results on real-world causal graphs.

AINeutralarXiv – CS AI Β· Mar 24/106
🧠

Less is more -- the Dispatcher/ Executor principle for multi-task Reinforcement Learning

Researchers propose a dispatcher/executor principle for multi-task Reinforcement Learning that partitions controllers into task-understanding and device-specific components connected by a regularized communication channel. This structural approach aims to improve generalization and data efficiency as an alternative to simply scaling large neural networks with vast datasets.

AINeutralarXiv – CS AI Β· Mar 24/109
🧠

Operator Learning with Domain Decomposition for Geometry Generalization in PDE Solving

Researchers propose a new framework called Operator Learning with Domain Decomposition to solve partial differential equations (PDEs) on arbitrary geometries using neural operators. The approach addresses data efficiency and geometry generalization challenges by breaking complex domains into smaller subdomains that can be solved locally and then combined into global solutions.