y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#reinforcement-learning News & Analysis

511 articles tagged with #reinforcement-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

511 articles
AIBullisharXiv – CS AI · Mar 166/10
🧠

CRAFT-GUI: Curriculum-Reinforced Agent For GUI Tasks

Researchers introduce CRAFT-GUI, a curriculum learning framework that uses reinforcement learning to improve AI agents' performance in graphical user interface tasks. The method addresses difficulty variation across GUI tasks and provides more nuanced feedback, achieving 5.6% improvement on Android Control benchmarks and 10.3% on internal benchmarks.

AIBullisharXiv – CS AI · Mar 166/10
🧠

Information-Consistent Language Model Recommendations through Group Relative Policy Optimization

Researchers developed a new reinforcement learning framework using Group Relative Policy Optimization (GRPO) to make Large Language Models provide consistent recommendations across semantically equivalent prompts. The method addresses a critical enterprise need for reliable AI systems in business domains like finance and customer support, where inconsistent responses undermine trust and compliance.

AIBullisharXiv – CS AI · Mar 126/10
🧠

Adaptive RAN Slicing Control via Reward-Free Self-Finetuning Agents

Researchers propose a novel self-finetuning framework for AI agents that enables continuous learning without handcrafted rewards, demonstrating superior performance in dynamic Radio Access Network slicing tasks. The approach uses bi-perspective reflection to generate autonomous feedback and distill long-term experiences into model parameters, outperforming traditional reinforcement learning methods.

AIBullisharXiv – CS AI · Mar 126/10
🧠

CLIPO: Contrastive Learning in Policy Optimization Generalizes RLVR

Researchers introduce CLIPO (Contrastive Learning in Policy Optimization), a new method that improves upon Reinforcement Learning with Verifiable Rewards (RLVR) for training Large Language Models. CLIPO addresses hallucination and answer-copying issues by incorporating contrastive learning to better capture correct reasoning patterns across multiple solution paths.

AIBullisharXiv – CS AI · Mar 126/10
🧠

Towards Cold-Start Drafting and Continual Refining: A Value-Driven Memory Approach with Application to NPU Kernel Synthesis

Researchers introduce EvoKernel, a self-evolving AI framework that addresses the 'Data Wall' problem in deploying Large Language Models for kernel synthesis on data-scarce hardware platforms like NPUs. The system uses memory-based reinforcement learning to improve correctness from 11% to 83% and achieves 3.60x speedup through iterative refinement.

AIBullisharXiv – CS AI · Mar 126/10
🧠

Dynamics-Predictive Sampling for Active RL Finetuning of Large Reasoning Models

Researchers propose Dynamics-Predictive Sampling (DPS), a new method that improves reinforcement learning finetuning of large language models by predicting which training prompts will be most informative without expensive computational rollouts. The technique models each prompt's learning progress as a dynamical system and uses Bayesian inference to select better training data, reducing computational overhead while achieving superior reasoning performance.

AIBullisharXiv – CS AI · Mar 116/10
🧠

Evaluate-as-Action: Self-Evaluated Process Rewards for Retrieval-Augmented Agents

Researchers propose EvalAct, a new method that improves retrieval-augmented AI agents by converting retrieval quality assessment into explicit actions and using Process-Calibrated Advantage Rescaling (PCAR) for optimization. The approach shows superior performance on multi-step reasoning tasks across seven open-domain QA benchmarks by providing better process-level feedback signals.

AIBullisharXiv – CS AI · Mar 116/10
🧠

Social-R1: Towards Human-like Social Reasoning in LLMs

Researchers introduce Social-R1, a reinforcement learning framework that enhances social reasoning in large language models by training on adversarial examples. The approach enables a 4B parameter model to outperform larger models across eight benchmarks by supervising the entire reasoning process rather than just outcomes.

AIBullisharXiv – CS AI · Mar 96/10
🧠

Boosting deep Reinforcement Learning using pretraining with Logical Options

Researchers propose Hybrid Hierarchical RL (H²RL), a new framework that combines symbolic logic with deep reinforcement learning to address misalignment issues in AI agents. The method uses logical option-based pretraining to improve long-horizon decision-making and prevent agents from over-exploiting short-term rewards.

AIBullisharXiv – CS AI · Mar 96/10
🧠

PRISM: Personalized Refinement of Imitation Skills for Manipulation via Human Instructions

PRISM is a new AI method that combines imitation learning and reinforcement learning to train robotic manipulation systems using human instructions and feedback. The approach allows generic robotic policies to be refined for specific tasks through natural language descriptions and human corrections, improving performance in pick-and-place tasks while reducing computational requirements.

AINeutralarXiv – CS AI · Mar 96/10
🧠

When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On

Researchers propose Implicit Error Counting (IEC), a new reinforcement learning approach for training AI models in domains where multiple valid outputs exist and traditional rubric-based evaluation fails. The method focuses on counting what responses get wrong rather than what they get right, with validation shown in virtual try-on applications where it outperforms existing rubric-based methods.

AIBullisharXiv – CS AI · Mar 96/10
🧠

CARE What Fails: Contrastive Anchored-REflection for Verifiable Multimodal

Researchers introduce CARE (Contrastive Anchored REflection), a new AI training framework that improves multimodal reasoning by learning from failures rather than just successes. The method achieved 4.6 point accuracy improvements on visual-reasoning benchmarks and reached state-of-the-art results on MathVista and MMMU-Pro when tested on Qwen models.

AIBullisharXiv – CS AI · Mar 66/10
🧠

Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

Researchers introduce RLSTA (Reinforcement Learning with Single-Turn Anchors), a new training method that addresses 'contextual inertia' - a problem where AI models fail to integrate new information in multi-turn conversations. The approach uses single-turn reasoning capabilities as anchors to improve multi-turn interaction performance across domains.

AIBullisharXiv – CS AI · Mar 66/10
🧠

CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

Researchers propose CTRL-RAG, a new reinforcement learning framework that improves large language models' ability to generate accurate, context-faithful responses in Retrieval-Augmented Generation systems. The method uses a Contrastive Likelihood Reward mechanism that optimizes the difference between responses with and without supporting evidence, addressing issues of hallucination and model collapse in existing RAG systems.

AINeutralarXiv – CS AI · Mar 55/10
🧠

IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning

Researchers propose Imaginary Planning Distillation (IPD), a novel framework that enhances offline reinforcement learning by incorporating planning into sequential policy models. IPD uses world models and Model Predictive Control to generate optimal rollouts, training Transformer-based policies that significantly outperform existing methods on D4RL benchmarks.

AINeutralarXiv – CS AI · Mar 45/104
🧠

QFlowNet: Fast, Diverse, and Efficient Unitary Synthesis with Generative Flow Networks

Researchers introduce QFlowNet, a novel framework combining Generative Flow Networks with Transformers to solve quantum circuit compilation challenges. The approach achieves 99.7% success rate on 3-qubit benchmarks while generating diverse, efficient quantum gate sequences, addressing key limitations of traditional reinforcement learning methods in quantum computing.

AINeutralarXiv – CS AI · Mar 45/103
🧠

VideoTemp-o3: Harmonizing Temporal Grounding and Video Understanding in Agentic Thinking-with-Videos

Researchers introduce VideoTemp-o3, a new AI framework that improves long-video understanding by intelligently identifying relevant video segments and performing targeted analysis. The system addresses key limitations in current video AI models including weak localization and rigid workflows through unified masking mechanisms and reinforcement learning rewards.

AIBullisharXiv – CS AI · Mar 36/108
🧠

MVR: Multi-view Video Reward Shaping for Reinforcement Learning

Researchers introduce Multi-View Video Reward Shaping (MVR), a new reinforcement learning framework that uses multi-viewpoint video analysis and vision-language models to improve reward design for complex AI tasks. The system addresses limitations of single-image approaches by analyzing dynamic motions across multiple camera angles, showing improved performance on humanoid locomotion and manipulation tasks.

AIBullisharXiv – CS AI · Mar 37/107
🧠

LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

Researchers propose Likelihood-Free Policy Optimization (LFPO), a new framework for improving Diffusion Large Language Models by bypassing likelihood computation issues that plague existing methods. LFPO uses geometric velocity rectification to optimize denoising logits directly, achieving better performance on code and reasoning tasks while reducing inference time by 20%.

AIBullisharXiv – CS AI · Mar 37/108
🧠

GAC: Stabilizing Asynchronous RL Training for LLMs via Gradient Alignment Control

Researchers propose GAC (Gradient Alignment Control), a new method to stabilize asynchronous reinforcement learning training for large language models. The technique addresses training instability issues that arise when scaling RL to modern AI workloads by regulating gradient alignment and preventing overshooting.

$NEAR
AINeutralarXiv – CS AI · Mar 37/108
🧠

Align and Filter: Improving Performance in Asynchronous On-Policy RL

Researchers propose a new method called total Variation-based Advantage aligned Constrained policy Optimization to address policy lag issues in distributed reinforcement learning systems. The approach aims to improve performance when scaling on-policy learning algorithms by mitigating the mismatch between behavior and learning policies during high-frequency updates.

AIBullisharXiv – CS AI · Mar 37/108
🧠

LOGIGEN: Logic-Driven Generation of Verifiable Agentic Tasks

Researchers introduce LOGIGEN, a logic-driven framework that synthesizes verifiable training data for autonomous AI agents operating in complex environments. The system uses a triple-agent orchestration approach and achieved a 79.5% success rate on benchmarks, nearly doubling the base model's 40.7% performance.

← PrevPage 12 of 21Next →