511 articles tagged with #reinforcement-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers have developed a new approach called Model Predictive Adversarial Imitation Learning that combines inverse reinforcement learning with model predictive control to enable AI agents to learn from incomplete human demonstrations. The method shows significant improvements in sample efficiency, generalization, and robustness compared to traditional imitation learning approaches.
AIBullisharXiv – CS AI · Mar 37/104
🧠Researchers introduce Self-Harmony, a new test-time reinforcement learning framework that improves AI model accuracy by having models solve problems and rephrase questions simultaneously. The method uses harmonic mean aggregation instead of majority voting to select stable answers, achieving state-of-the-art results across 28 of 30 reasoning benchmarks without requiring human supervision.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce RLP (Reinforcement Learning Pretraining), a new training method that incorporates reinforcement learning exploration into the pretraining phase rather than only post-training. The approach treats chain-of-thought reasoning as exploratory actions and achieved 19% performance improvements on math and science benchmarks across different model architectures.
$COMP
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce LongWriter-Zero, a reinforcement learning approach that enables large language models to generate ultra-long, high-quality text without relying on synthetic training data. The 32B parameter model outperforms traditional supervised fine-tuning methods and even surpasses larger 100B+ models on long-form writing benchmarks.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce SPIRAL, a self-play reinforcement learning framework that enables language models to develop reasoning capabilities by playing zero-sum games against themselves without human supervision. The system improves performance by up to 10% across 8 reasoning benchmarks on multiple model families including Qwen and Llama.
AIBullisharXiv – CS AI · Mar 37/103
🧠Researchers introduce ExGRPO, a new framework that improves AI reasoning by reusing and prioritizing valuable training experiences based on correctness and entropy. The method shows consistent performance gains of +3.5-7.6 points over standard approaches across multiple model sizes while providing more stable training.
AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers propose Supervised Reinforcement Learning (SRL), a new training framework that helps small-scale language models solve complex multi-step reasoning problems by generating internal reasoning monologues and providing step-wise rewards. SRL outperforms traditional Supervised Fine-Tuning and Reinforcement Learning approaches, enabling smaller models to tackle previously unlearnable problems.
AIBullisharXiv – CS AI · Feb 277/108
🧠Researchers propose Generalized On-Policy Distillation (G-OPD), a new AI training framework that improves upon standard on-policy distillation by introducing flexible reference models and reward scaling factors. The method, particularly ExOPD with reward extrapolation, enables smaller student models to surpass their teacher's performance in math reasoning and code generation tasks.
AIBullisharXiv – CS AI · Feb 277/109
🧠Researchers achieved breakthrough sample complexity improvements for offline reinforcement learning algorithms using f-divergence regularization, particularly for contextual bandits. The study demonstrates optimal O(ε⁻¹) sample complexity under single-policy concentrability conditions, significantly improving upon existing bounds.
$NEAR
AINeutralarXiv – CS AI · Feb 277/106
🧠Researchers developed a new theoretical framework for accelerated risk-averse policy evaluation in partially observable Markov decision processes (POMDPs) using Conditional Value-at-Risk (CVaR) bounds. The method enables safe elimination of suboptimal actions while maintaining computational guarantees, achieving substantial speedups in autonomous agent decision-making under uncertainty.
AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers propose Decision MetaMamba (DMM), a new AI model architecture that improves offline reinforcement learning by addressing information loss issues in Mamba-based models. The solution uses a dense layer-based sequence mixer and modified positional structure to achieve state-of-the-art performance with fewer parameters.
AIBullisharXiv – CS AI · Feb 277/104
🧠Researchers developed Hyper Diffusion Planner (HDP), a diffusion model-based framework for end-to-end autonomous driving that achieved 10x performance improvement over base models in real-world testing. The study conducted comprehensive evaluation across 200 km of real-world driving scenarios, demonstrating diffusion models can effectively scale to complex autonomous driving tasks when properly designed and trained.
AIBullisharXiv – CS AI · Feb 277/106
🧠Researchers propose EGPO, a new framework that improves large reasoning models by incorporating uncertainty awareness into reinforcement learning training. The approach addresses the "uncertainty-reward mismatch" where current training methods treat high and low-confidence solutions equally, preventing models from developing better reasoning capabilities.
AIBullisharXiv – CS AI · Feb 277/105
🧠Researchers have introduced AIQI (Universal AI with Q-Induction), the first model-free artificial intelligence agent proven to be asymptotically optimal in general reinforcement learning. Unlike previous optimal agents like AIXI that rely on environment models, AIQI performs universal induction over distributional action-value functions, significantly expanding the diversity of known universal agents.
AIBullishSynced Review · Jun 167/105
🧠MIT researchers have developed SEAL, a new framework that enables large language models to self-edit and update their own weights through reinforcement learning. This represents a significant advancement toward creating AI systems capable of autonomous self-improvement.
AIBullishOpenAI News · May 167/107
🧠OpenAI has released Codex, a cloud-based coding agent powered by codex-1, which is an optimized version of OpenAI o3 specifically designed for software engineering tasks. The system was trained using reinforcement learning on real-world coding environments to generate human-like code that follows instructions precisely and iteratively tests until achieving passing results.
AIBullishSynced Review · Apr 247/105
🧠Kwai AI has developed SRPO, a new reinforcement learning framework that reduces LLM post-training steps by 90% while achieving performance comparable to DeepSeek-R1 in mathematics and coding tasks. The two-stage approach with history resampling addresses efficiency limitations in existing GRPO methods.
AIBullishOpenAI News · Sep 127/106
🧠OpenAI has introduced o1, a new large language model that uses reinforcement learning to perform complex reasoning tasks. The model generates an internal chain of thought before providing responses, representing a significant advancement in AI reasoning capabilities.
AIBullishOpenAI News · Sep 47/105
🧠Researchers have successfully applied reinforcement learning from human feedback (RLHF) to improve language model summarization capabilities. This approach uses human preferences to guide the training process, resulting in models that produce higher quality summaries aligned with human expectations.
AIBullishOpenAI News · Oct 157/105
🧠OpenAI has trained neural networks to solve a Rubik's Cube using a human-like robot hand, with training conducted entirely in simulation using reinforcement learning and a new technique called Automatic Domain Randomization (ADR). The system demonstrates unprecedented dexterity and can handle unexpected physical situations it never encountered during training, showing reinforcement learning's potential for complex real-world applications.
AIBullishOpenAI News · Mar 47/103
🧠Neural MMO is a new massively multiagent game environment designed for training reinforcement learning agents. The platform enables a large, variable number of agents to interact in persistent, open-ended tasks, promoting better exploration and niche formation among AI agents.
AIBullishOpenAI News · Oct 317/108
🧠OpenAI researchers have developed Random Network Distillation (RND), a reinforcement learning method that uses prediction-based rewards to encourage AI agents to explore environments through curiosity. This breakthrough represents the first time an AI system has exceeded average human performance on the notoriously difficult Atari game Montezuma's Revenge.
AIBullishOpenAI News · Aug 117/105
🧠OpenAI has developed an AI bot that defeats world-class professional players in 1v1 Dota 2 matches under standard tournament rules. The bot learned entirely through self-play without using imitation learning or tree search techniques, representing a significant advancement in AI systems handling complex, real-world scenarios.
AIBullishOpenAI News · Jul 207/105
🧠OpenAI has released Proximal Policy Optimization (PPO), a new class of reinforcement learning algorithms that matches or exceeds state-of-the-art performance while being significantly simpler to implement and tune. PPO has been adopted as OpenAI's default reinforcement learning algorithm due to its ease of use and strong performance characteristics.
AIBullishOpenAI News · Mar 247/104
🧠Researchers have found that evolution strategies (ES), a decades-old optimization technique, can match the performance of modern reinforcement learning methods on standard benchmarks like Atari and MuJoCo. This discovery suggests ES could serve as a more scalable alternative to traditional RL approaches while avoiding many of RL's practical limitations.