AIBullisharXiv – CS AI · 3d ago7/10
🧠Researchers introduce CORE (Contrastive Reflection), a non-parametric learning algorithm that improves language model reasoning by comparing successful and unsuccessful problem attempts to generate natural-language insights. The method achieves faster improvements than existing parametric and non-parametric approaches while requiring significantly fewer model rollouts and training samples, offering a more efficient and interpretable alternative to weight updates or prompt optimization.
AIBullisharXiv – CS AI · May 127/10
🧠Researchers propose Latent Personality Alignment (LPA), a novel defense mechanism for large language models that achieves adversarial robustness by training on abstract personality traits rather than harmful examples. The method requires fewer than 100 training examples while matching the performance of traditional approaches using 150,000+ harmful prompts, and demonstrates superior generalization to unseen attack vectors.
AIBullisharXiv – CS AI · May 117/10
🧠Researchers introduce ROPD, a rubric-based on-policy distillation framework that replaces teacher logits with structured semantic rubrics for model alignment. The approach achieves up to 10x better sample efficiency than logit-based methods while enabling distillation from proprietary black-box LLMs, addressing a critical scalability limitation in current model training.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce LANTERN, a framework that uses large language models to automatically generate task descriptions and intelligently aggregate knowledge from multiple source tasks for reinforcement learning. The system achieves 40-60% improvements in sample efficiency by adaptively weighting source policies based on task similarity and managing teacher-student knowledge transfer through uncertainty-aware gating.
AIBullisharXiv – CS AI · May 97/10
🧠Researchers introduce BEACON, a milestone-guided policy learning framework that significantly improves training efficiency for long-horizon language agents by solving credit misattribution and sample inefficiency problems. The approach achieves 92.9% success rates on complex tasks—nearly double previous benchmarks—while improving sample utilization from 23.7% to 82.0%.
AIBullisharXiv – CS AI · May 77/10
🧠Researchers propose a novel framework that models language model memory as a Markov transition matrix, enabling efficient incorporation of new knowledge without catastrophic forgetting. The approach requires only linear sample complexity in the number of existing tokens and achieves zero forgetting through minimal parameter updates via an embedding-tuning algorithm.
AIBullisharXiv – CS AI · Apr 147/10
🧠Researchers introduce Zero-shot Visual World Models (ZWM), a computational framework inspired by how young children learn physical understanding from minimal data. The approach combines sparse prediction, causal inference, and compositional reasoning to achieve data-efficient learning, demonstrating that AI systems can match child development patterns while learning from single-child observational data.
AIBullisharXiv – CS AI · Mar 97/10
🧠Researchers introduce COLD-Steer, a training-free framework that enables efficient control of large language model behavior at inference time using just a few examples. The method approximates gradient descent effects without parameter updates, achieving 95% steering effectiveness while using 50 times fewer samples than existing approaches.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers have developed a new approach called Model Predictive Adversarial Imitation Learning that combines inverse reinforcement learning with model predictive control to enable AI agents to learn from incomplete human demonstrations. The method shows significant improvements in sample efficiency, generalization, and robustness compared to traditional imitation learning approaches.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers have developed Curvature-Aware Policy Optimization (CAPO), a new algorithm that improves training stability and sample efficiency for Large Language Models by up to 30x. The method uses advanced mathematical optimization techniques to identify and filter problematic training samples, requiring intervention on fewer than 8% of tokens.
AIBullisharXiv – CS AI · 2d ago6/10
🧠Researchers propose PACED-RL, a novel post-training framework that reinterprets the partition function in GFlowNet-based LLM training as a difficulty scheduler rather than merely a normalizer. By leveraging per-prompt accuracy signals, the method improves sample efficiency and maintains generation diversity while outperforming existing reward-maximizing approaches.
AIBullisharXiv – CS AI · 4d ago6/10
🧠Researchers introduce R²VPO, a new reinforcement learning method that replaces hard clipping mechanisms with ratio-variance regularization to improve policy optimization. Tested across large language models and robotic control tasks, the approach achieves better performance on mathematical reasoning and sample efficiency while maintaining stable learning.
$VPO
AINeutralarXiv – CS AI · May 126/10
🧠Researchers introduce Neuro-Symbolic Experience Replay (NSER), a framework that enhances reinforcement learning by combining Large Language Models with symbolic logic to transform passive memory buffers into active knowledge construction systems. The approach grounds LLM-generated behavioral rules into differentiable logic representations, enabling more efficient policy optimization across multiple benchmark environments.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers introduce POETS, a novel framework that optimizes large language models through compute-efficient policy ensembles while quantifying uncertainty. By leveraging KL-regularized Thompson sampling and shared backbone architectures with independent LoRA branches, POETS achieves superior sample efficiency in scientific discovery tasks while reducing computational overhead compared to traditional ensemble methods.
AINeutralarXiv – CS AI · May 116/10
🧠Researchers propose Direct Reasoning Optimization (DRO), a constrained reinforcement learning framework that improves LLM training on unverifiable tasks by combining token-level reasoning rewards with rubric-based feasibility gates. The approach demonstrates faster, more sample-efficient learning across scientific, medical, legal, and financial domains.
AIBullisharXiv – CS AI · May 116/10
🧠Researchers introduce Goldilocks, a curriculum learning strategy that improves reinforcement learning efficiency for language models by having a teacher model dynamically select training questions of optimal difficulty for the student model. This addresses the sample inefficiency problem in sparse-reward RL training and demonstrates performance gains on reasoning tasks compared to standard approaches.
AINeutralarXiv – CS AI · May 96/10
🧠Researchers propose VPSD-RL, a reinforcement learning framework that discovers value-preserving structures in continuous control tasks using Lie-group operators and diffusion models. The method improves data efficiency and robustness by identifying nonlinear transformations that preserve optimal value functions, addressing brittleness in RL systems under environmental variability.
AINeutralarXiv – CS AI · May 76/10
🧠Researchers propose a generalization of differential temporal difference (TD) methods that extends their applicability from infinite-horizon to episodic reinforcement learning problems. By addressing how reward centering affects policy optimization in episodic settings, the work maintains theoretical guarantees while empirically demonstrating improved sample efficiency across multiple algorithms and environments.
AINeutralarXiv – CS AI · May 16/10
🧠Researchers introduce EXPO, a reinforcement learning algorithm that trains expressive policies (like diffusion models) more efficiently by avoiding direct value optimization. The method uses a lightweight Gaussian policy to edit actions from a base policy, achieving 2-3x improvements in sample efficiency for both offline-to-online and fine-tuning scenarios.
AINeutralarXiv – CS AI · Apr 136/10
🧠Researchers introduce WOMBET, a framework that improves reinforcement learning efficiency in robotics by generating synthetic training data from a world model in source tasks and selectively transferring it to target tasks. The approach combines offline-to-online learning with uncertainty-aware planning to reduce data collection costs while maintaining robustness.
AIBullisharXiv – CS AI · Apr 136/10
🧠Researchers propose a neuro-symbolic deep reinforcement learning approach that integrates logical rules and symbolic knowledge to improve sample efficiency and generalization in RL systems. The method transfers partial policies from simple tasks to complex ones, reducing training data requirements and improving performance in sparse-reward environments compared to existing baselines.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers propose MA-VLCM, a framework that uses pretrained vision-language models as centralized critics in multi-agent reinforcement learning instead of learning critics from scratch. This approach significantly improves sample efficiency and enables zero-shot generalization while producing compact policies suitable for resource-constrained robots.
AIBullisharXiv – CS AI · Mar 176/10
🧠Researchers introduce XQC, a deep reinforcement learning algorithm that achieves state-of-the-art sample efficiency by optimizing the critic network's condition number through batch normalization, weight normalization, and distributional cross-entropy loss. The method outperforms existing approaches across 70 continuous control tasks while using fewer parameters.
AIBullisharXiv – CS AI · Mar 37/109
🧠Researchers introduce HiMAC, a hierarchical reinforcement learning framework that improves LLM agent performance on long-horizon tasks by separating macro-level planning from micro-level execution. The approach demonstrates state-of-the-art results across multiple environments, showing that structured hierarchy is more effective than simply scaling model size for complex agent tasks.
AIBullisharXiv – CS AI · Mar 37/108
🧠Researchers propose EfficientZero-Multitask (EZ-M), a multi-task model-based reinforcement learning algorithm that scales the number of tasks rather than samples per task for robotics training. The approach achieves state-of-the-art performance on HumanoidBench with significantly higher sample efficiency by leveraging shared world models across diverse tasks.