#sample-efficiency News & Analysis

56 articles tagged with #sample-efficiency. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

56 articles

AIBullisharXiv – CS AI · Jun 47/10

🧠

Making Expert Reasoning Learnable with Self-Distillation

Researchers propose Distribution Aligned Imitation Learning (DAIL), a self-distillation method that improves LLM reasoning by converting expert human solutions into computational training data. The technique achieves significant performance gains on frontier models using fewer than 1000 expert examples, addressing the challenge that expert solutions are typically written for humans rather than machines.

AIBullisharXiv – CS AI · Jun 27/10

🧠

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

Researchers propose POPO (Group Prioritized Off-Policy Optimization), a new framework that improves reinforcement learning for large language model reasoning by efficiently reusing ineffective training samples without computational overhead. The method addresses a critical limitation in RLVR systems where many training samples yield zero-variance rewards, enabling faster model improvement across mathematics, planning, and visual reasoning tasks.

AIBullisharXiv – CS AI · Jun 17/10

🧠

Scaling Multi-Agent Environment Co-Design with Diffusion Models

Researchers introduce Diffusion Co-Design (DiCoDe), a scalable framework that jointly optimizes agent policies and environment configurations using diffusion models with novel constraint-handling and knowledge-sharing mechanisms. The method achieves 39% higher rewards with 66% fewer simulations in warehouse automation, demonstrating significant advances in multi-agent system deployment across logistics, pathfinding, and renewable energy domains.

AIBullisharXiv – CS AI · May 287/10

🧠

CORE: Contrastive Reflection Enables Rapid Improvements in Reasoning

Researchers introduce CORE (Contrastive Reflection), a non-parametric learning algorithm that improves language model reasoning by comparing successful and unsuccessful problem attempts to generate natural-language insights. The method achieves faster improvements than existing parametric and non-parametric approaches while requiring significantly fewer model rollouts and training samples, offering a more efficient and interpretable alternative to weight updates or prompt optimization.

AIBullisharXiv – CS AI · May 127/10

🧠

Latent Personality Alignment: Improving Harmlessness Without Mentioning Harms

Researchers propose Latent Personality Alignment (LPA), a novel defense mechanism for large language models that achieves adversarial robustness by training on abstract personality traits rather than harmful examples. The method requires fewer than 100 training examples while matching the performance of traditional approaches using 150,000+ harmful prompts, and demonstrates superior generalization to unseen attack vectors.

AIBullisharXiv – CS AI · May 117/10

🧠

Rubric-based On-policy Distillation

Researchers introduce ROPD, a rubric-based on-policy distillation framework that replaces teacher logits with structured semantic rubrics for model alignment. The approach achieves up to 10x better sample efficiency than logit-based methods while enabling distillation from proprietary black-box LLMs, addressing a critical scalability limitation in current model training.

AIBullisharXiv – CS AI · May 97/10

🧠

LANTERN: LLM-Augmented Neurosymbolic Transfer with Experience-Gated Reasoning Networks

Researchers introduce LANTERN, a framework that uses large language models to automatically generate task descriptions and intelligently aggregate knowledge from multiple source tasks for reinforcement learning. The system achieves 40-60% improvements in sample efficiency by adaptively weighting source policies based on task similarity and managing teacher-student knowledge transfer through uncertainty-aware gating.

AIBullisharXiv – CS AI · May 97/10

🧠

Milestone-Guided Policy Learning for Long-Horizon Language Agents

Researchers introduce BEACON, a milestone-guided policy learning framework that significantly improves training efficiency for long-horizon language agents by solving credit misattribution and sample inefficiency problems. The approach achieves 92.9% success rates on complex tasks—nearly double previous benchmarks—while improving sample utilization from 23.7% to 82.0%.

AIBullisharXiv – CS AI · May 77/10

🧠

Memory as a Markov Matrix: Sample Efficient Knowledge Expansion via Token-to-Dictionary Mapping

Researchers propose a novel framework that models language model memory as a Markov transition matrix, enabling efficient incorporation of new knowledge without catastrophic forgetting. The approach requires only linear sample complexity in the number of existing tokens and achieves zero forgetting through minimal parameter updates via an embedding-tuning algorithm.

AIBullisharXiv – CS AI · Apr 147/10

🧠

Zero-shot World Models Are Developmentally Efficient Learners

Researchers introduce Zero-shot Visual World Models (ZWM), a computational framework inspired by how young children learn physical understanding from minimal data. The approach combines sparse prediction, causal inference, and compositional reasoning to achieve data-efficient learning, demonstrating that AI systems can match child development patterns while learning from single-child observational data.

AIBullisharXiv – CS AI · Mar 97/10

🧠

COLD-Steer: Steering Large Language Models via In-Context One-step Learning Dynamics

Researchers introduce COLD-Steer, a training-free framework that enables efficient control of large language model behavior at inference time using just a few examples. The method approximates gradient descent effects without parameter updates, achieving 95% steering effectiveness while using 50 times fewer samples than existing approaches.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Stabilizing Policy Gradients for Sample-Efficient Reinforcement Learning in LLM Reasoning

Researchers have developed Curvature-Aware Policy Optimization (CAPO), a new algorithm that improves training stability and sample efficiency for Large Language Models by up to 30x. The method uses advanced mathematical optimization techniques to identify and filter problematic training samples, requiring intervention on fewer than 8% of tokens.

AIBullisharXiv – CS AI · Mar 37/103

🧠

Model Predictive Adversarial Imitation Learning for Planning from Observation

Researchers have developed a new approach called Model Predictive Adversarial Imitation Learning that combines inverse reinforcement learning with model predictive control to enable AI agents to learn from incomplete human demonstrations. The method shows significant improvements in sample efficiency, generalization, and robustness compared to traditional imitation learning approaches.

AINeutralarXiv – CS AI · Jun 256/10

🧠

Confidence Sequences for Online Statistical Model Checking of Markov Decision Processes

Researchers present new confidence sequence methods for statistical model checking of Markov decision processes in online settings, achieving 50x sample efficiency improvements over previous approaches. The work addresses the practical problem of obtaining meaningful guarantees when exact transition probabilities are unknown, with applications to cyber-physical and biological systems.

AINeutralarXiv – CS AI · Jun 235/10

🧠

NASDAQ: Normalized Observation Space Dynamics-Augmented Q-Learning

Researchers propose NASDAQ, a reinforcement learning framework that addresses performance degradation in low-dimensional observation tasks by normalizing observation spaces before dynamics prediction. The method balances reconstruction losses across observation dimensions and achieves competitive performance with faster training than existing model-based and self-predictive RL approaches.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Select-to-Act: Hierarchical Reinforcement Learning via Adaptive Language Guidance

Researchers propose HRLLI, a hierarchical reinforcement learning framework that dynamically selects relevant natural-language instruction segments to guide agent decision-making at different stages of task execution. The approach outperforms existing instruction-conditioned RL baselines by treating language as adaptive, stage-specific guidance rather than static input, improving sample efficiency in complex environments.

AINeutralarXiv – CS AI · Jun 235/10

🧠

UBP2: Uncertainty-Balanced Preference Planning for Efficient Preference-based Reinforcement Learning

Researchers introduce UBP2, a model-based reinforcement learning method that improves sample efficiency in preference-based learning by actively directing exploration through uncertainty quantification across reward, dynamics, and value functions. The approach achieves sublinear regret guarantees and demonstrates substantially higher sample efficiency than existing methods on benchmark tasks.

AINeutralarXiv – CS AI · Jun 236/10

🧠

Backpropagating Through Simulation: Analytic Policy Gradients for Sample and Learning Efficient Differentiable Continuous Control

Researchers propose Analytic Policy Gradients (APG), a method that computes exact policy gradients through backpropagation in differentiable simulators, contrasting with model-free approaches like PPO that rely on sampled rewards. Testing across four continuous control tasks shows APG achieves superior sample efficiency, with a segmented backpropagation scheme that mitigates gradient degradation on long-horizon problems.

AIBullisharXiv – CS AI · Jun 196/10

🧠

Which Pairs to Compare for LLM Post-Training?

Researchers present a theoretical framework for optimizing which comparison pairs to label during large language model preference-based post-training, showing that strategic pair selection can significantly improve sample efficiency. By formulating the problem as a sampling-design challenge with bounds on policy performance, the work provides practical guidance for allocating limited labeling budgets when training models like those using Direct Preference Optimization.

AIBullisharXiv – CS AI · Jun 116/10

🧠

Diffusing to Coordinate: Efficient Online Multi-Agent Diffusion Policies

Researchers introduce OMAD, an online multi-agent reinforcement learning framework that integrates diffusion-based generative models for improved policy coordination. The method achieves 2.5-5x improvements in sample efficiency across benchmark tasks by using relaxed policy objectives and joint distributional value functions to enable effective exploration without requiring tractable likelihood calculations.

AIBullisharXiv – CS AI · Jun 116/10

🧠

The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

Researchers introduce MiDiGap, a machine learning approach using Gaussian Process Mixtures for robot policy learning that achieves state-of-the-art results in manipulation tasks from minimal demonstrations. The method learns complex behaviors like making coffee and opening doors in under a minute on CPU, with significant performance improvements over existing benchmarks and notable cross-embodiment transfer capabilities.

AINeutralarXiv – CS AI · Jun 116/10

🧠

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

Researchers introduce CDQAC, an offline reinforcement learning algorithm that learns effective job scheduling policies from static, suboptimal datasets rather than requiring extensive online training interactions. The breakthrough demonstrates that scheduling performance depends primarily on state-action coverage rather than trajectory quality, enabling the algorithm to learn effectively from even simple random heuristics while requiring only 1-5% of original dataset size.

AINeutralarXiv – CS AI · Jun 116/10

🧠

HERO: Hindsight-Enhanced Reflection from Environment Observations for Agentic Self-Distillation

Researchers introduce HERO, a self-distillation framework for reinforcement learning agents that uses environment observations as feedback to improve multi-turn decision-making. The method addresses credit assignment problems in sequential tasks by converting observations into actionable diagnoses, outperforming existing approaches on benchmark tasks with limited training data.

AIBullisharXiv – CS AI · Jun 106/10

🧠

Structure from Reasoning, Numbers from Search: On-Premise Open LLMs as Structural Priors for Coupled MIMO Controller Tuning

Researchers demonstrate that on-premise open-source large language models can serve as structural priors for tuning complex industrial control systems, particularly excelling on strongly coupled MIMO systems where traditional methods fail. The approach achieves superior sample efficiency and interpretability compared to classical optimization, reaching near-optimal controller tuning in 18 evaluations versus hundreds needed by global optimizers.

AINeutralarXiv – CS AI · Jun 106/10

🧠

TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning

Researchers introduce TRACE, a rollout budget allocation framework that improves reinforcement learning for large language models by optimizing reward signals across multi-turn agentic tasks. The method allocates computational resources to both initial prompts and intermediate decision points within conversations, demonstrating 2.8-point accuracy improvements on benchmarks at equivalent sampling costs.

Page 1 of 3Next →