500 articles tagged with #reinforcement-learning. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AINeutralarXiv โ CS AI ยท Apr 106/10
๐ง Researchers introduce REVEAL, an explainable AI framework for detecting AI-generated images through forensic evidence chains and expert-grounded reinforcement learning. The approach addresses the growing challenge of distinguishing synthetic images from authentic ones while providing transparent, verifiable reasoning for detection decisions.
AIBullisharXiv โ CS AI ยท Apr 106/10
๐ง Researchers introduce RePro, a novel post-training technique that optimizes large language models' reasoning processes by framing chain-of-thought as gradient descent and using process-level rewards to reduce overthinking. The method demonstrates consistent performance improvements across mathematics, science, and coding benchmarks while mitigating inefficient reasoning behaviors in LLMs.
AINeutralarXiv โ CS AI ยท Apr 106/10
๐ง Q-Probe introduces a novel agentic framework for scaling image quality assessment to high-resolution images by addressing limitations in existing reinforcement learning approaches. The research presents Vista-Bench, a new benchmark for fine-grained degradation analysis, and demonstrates state-of-the-art performance across multiple resolution scales through context-aware probing mechanisms.
AINeutralarXiv โ CS AI ยท Apr 76/10
๐ง Research reveals that adaptive reward mechanisms in AI-guided satellite scheduling systems actually hurt performance, with static reward weights achieving 342.1 Mbps versus dynamic weights at only 103.3 Mbps. The study found that fine-tuned LLMs performed poorly due to weight oscillation issues, while simpler MLP models achieved superior results of 357.9 Mbps.
AIBullisharXiv โ CS AI ยท Apr 76/10
๐ง Researchers introduce PRAISE, a new framework that improves training efficiency for AI agents performing complex search tasks like multi-hop question answering. The method addresses key limitations in current reinforcement learning approaches by reusing partial search trajectories and providing intermediate rewards rather than only final answer feedback.
AINeutralarXiv โ CS AI ยท Apr 76/10
๐ง Researchers developed a four-layer pedagogical safety framework for AI tutoring systems and introduced the Reward Hacking Severity Index (RHSI) to measure misalignment between proxy rewards and genuine learning. Their study of 18,000 simulated interactions found that engagement-optimized AI agents systematically selected high-engagement actions with no learning benefits, requiring constrained architectures to reduce reward hacking.
AIBullisharXiv โ CS AI ยท Apr 76/10
๐ง Researchers have developed Memory Intelligence Agent (MIA), a new AI framework that improves deep research agents through a Manager-Planner-Executor architecture with advanced memory systems. The framework enables continuous learning during inference and demonstrates superior performance across eleven benchmarks through enhanced cooperation between parametric and non-parametric memory systems.
AINeutralarXiv โ CS AI ยท Apr 76/10
๐ง Researchers developed an AI framework using reinforcement learning to automatically discover failure modes in vision-language models without human intervention. The system trains a questioner agent that generates adaptive queries to expose weaknesses, successfully identifying 36 novel failure modes across various VLM combinations.
AIBullisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers introduce PROGRS, a new framework that improves mathematical reasoning in large language models by using process reward models while maintaining focus on outcome correctness. The approach addresses issues with current reinforcement learning methods that can reward fluent but incorrect reasoning steps.
AIBullisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers have developed OPRIDE, a new algorithm for offline preference-based reinforcement learning that significantly improves query efficiency. The algorithm addresses key challenges of inefficient exploration and overoptimization through principled exploration strategies and discount scheduling mechanisms.
AIBullisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers propose Rubrics to Tokens (RTT), a novel reinforcement learning framework that improves Large Language Model alignment by bridging response-level and token-level rewards. The method addresses reward sparsity and ambiguity issues in instruction-following tasks through fine-grained credit assignment and demonstrates superior performance across different models.
AIBullisharXiv โ CS AI ยท Apr 66/10
๐ง Researchers introduce R2-Write, a new AI framework that improves large language models' performance on open-ended writing tasks by incorporating explicit reflection and revision patterns. The study reveals that existing reasoning models show limited gains in creative writing compared to mathematical tasks, prompting the development of an automated system with writer-judge interactions and process reward mechanisms.
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง Researchers introduce RC2, a reinforcement learning framework that improves multimodal AI reasoning by enforcing consistency between visual and textual representations. The system uses cycle-consistent training to resolve internal conflicts between modalities, achieving up to 7.6 point improvements in reasoning accuracy without requiring additional labeled data.
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง Researchers propose X-OPD, a Cross-Modal On-Policy Distillation framework to improve Speech Large Language Models by aligning them with text-based counterparts. The method uses token-level feedback from teacher models to bridge performance gaps in end-to-end speech systems while preserving inherent capabilities.
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง Researchers developed a multi-answer reinforcement learning approach that trains language models to generate multiple plausible answers with confidence estimates in a single forward pass, rather than collapsing to one dominant answer. The method shows improved diversity and accuracy across question-answering, medical diagnosis, and coding benchmarks while being more computationally efficient than existing approaches.
AIBullisharXiv โ CS AI ยท Mar 276/10
๐ง Researchers introduce TimeLens, a family of multimodal large language models optimized for video temporal grounding that outperforms existing open-source models and even surpasses proprietary models like GPT-5 and Gemini-2.5-Flash. The work addresses critical data quality issues in existing benchmarks and introduces improved training datasets and algorithmic design principles.
๐ง GPT-5๐ง Gemini
AIBullisharXiv โ CS AI ยท Mar 266/10
๐ง Researchers propose Preference-based Constrained Reinforcement Learning (PbCRL), a new approach for safe AI decision-making that learns safety constraints from human preferences rather than requiring extensive expert demonstrations. The method addresses limitations in existing Bradley-Terry models by introducing a dead zone mechanism and Signal-to-Noise Ratio loss to better capture asymmetric safety costs and improve constraint alignment.
AIBullisharXiv โ CS AI ยท Mar 266/10
๐ง Researchers propose Dual Guidance Optimization (DGO), a new framework that improves large language model training by combining external experience banks with internal knowledge to better mimic human learning patterns. The approach shows consistent improvements over existing reinforcement learning methods for reasoning tasks.
AIBullisharXiv โ CS AI ยท Mar 266/10
๐ง Researchers developed a scalable multi-turn synthetic data generation pipeline using reinforcement learning to improve large language models' code generation capabilities. The approach uses teacher models to create structured difficulty progressions and curriculum-based training, showing consistent improvements in code generation across Llama3.1-8B and Qwen models.
๐ง Llama
AINeutralarXiv โ CS AI ยท Mar 266/10
๐ง Researchers introduce GeoSketch, a neural-symbolic AI framework that solves geometric problems through dynamic visual manipulation, including drawing auxiliary lines and applying transformations. The system combines perception, symbolic reasoning, and interactive sketch actions, achieving superior performance on geometric problem-solving benchmarks compared to static image processing methods.
AIBullisharXiv โ CS AI ยท Mar 266/10
๐ง Researchers introduce Generative Adversarial Reasoner, a new training framework that improves LLM mathematical reasoning by using adversarial reinforcement learning between a reasoner and discriminator model. The method achieved significant performance gains on mathematical benchmarks, improving DeepSeek models by 7-10 percentage points on AIME24 tests.
๐ง Llama
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers developed REFINE-DP, a hierarchical framework that combines diffusion policies with reinforcement learning to enable humanoid robots to perform complex loco-manipulation tasks. The system achieves over 90% success rate in simulation and demonstrates smooth autonomous execution in real-world environments for tasks like door traversal and object transport.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers developed a resource-efficient framework for compressing large language models using knowledge distillation and chain-of-thought reinforcement learning. The method successfully compressed Qwen 3B to 0.5B while retaining 70-95% of performance across English, Spanish, and coding tasks, making AI models more suitable for resource-constrained deployments.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers introduce SmoothVLA, a new reinforcement learning framework that improves robot control by optimizing both task performance and motion smoothness. The system addresses the trade-off between stability and exploration in Vision-Language-Action models, achieving 13.8% better smoothness than standard RL methods.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers have developed a new audio-visual speech enhancement framework that uses Large Language Models and reinforcement learning to improve speech quality. The method outperforms existing baselines by using LLM-generated natural language feedback as rewards for model training, providing more interpretable optimization compared to traditional scalar metrics.