35 articles tagged with #policy-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.
AIBullisharXiv โ CS AI ยท 2d ago7/10
๐ง Researchers introduce Inverse-RPO, a methodology for deriving prior-based tree policies in Monte Carlo Tree Search from first principles, and apply it to create variance-aware UCT algorithms that outperform PUCT without additional computational overhead. This advances the theoretical foundation of MCTS used in reinforcement learning systems like AlphaZero.
AIBullisharXiv โ CS AI ยท 3d ago7/10
๐ง Researchers introduce SafeAdapt, a novel framework for updating reinforcement learning policies while maintaining provable safety guarantees across changing environments. The approach uses a 'Rashomon set' to identify safe parameter regions and projects policy updates onto this certified space, addressing the critical challenge of deploying RL agents in safety-critical applications where dynamics and objectives evolve over time.
AIBullisharXiv โ CS AI ยท Apr 67/10
๐ง Researchers propose Sign-Certified Policy Optimization (SignCert-PO) to address reward hacking in reinforcement learning from human feedback (RLHF), a critical problem where AI models exploit learned reward systems rather than improving actual performance. The lightweight approach down-weights non-robust responses during policy optimization and showed improved win rates on summarization and instruction-following benchmarks.
AIBullisharXiv โ CS AI ยท Mar 167/10
๐ง Researchers introduce Guided Policy Optimization (GPO), a new reinforcement learning framework that addresses challenges in partially observable environments by co-training a guider with privileged information and a learner through imitation learning. The method demonstrates theoretical optimality comparable to direct RL and shows strong empirical performance across various tasks including continuous control and memory-based challenges.
AIBullisharXiv โ CS AI ยท Mar 117/10
๐ง Researchers introduce Stepwise Guided Policy Optimization (SGPO), a new framework that improves upon Group Relative Policy Optimization (GRPO) by learning from incorrect reasoning responses in large language model training. SGPO addresses the limitation where GRPO fails to update policies when all responses in a group are incorrect, showing improved performance across multiple model sizes and reasoning benchmarks.
AIBullisharXiv โ CS AI ยท Mar 56/10
๐ง GIPO (Gaussian Importance Sampling Policy Optimization) is a new reinforcement learning method that improves data efficiency for training multimodal AI agents. The approach uses Gaussian trust weights instead of hard clipping to better handle scarce or outdated training data, showing superior performance and stability across various experimental conditions.
AIBullisharXiv โ CS AI ยท Mar 46/103
๐ง Researchers introduce RAPO (Retrieval-Augmented Policy Optimization), a new reinforcement learning framework that improves LLM agent training by incorporating retrieval mechanisms for broader exploration. The method achieves 5% performance gains across 14 datasets and 1.2x faster training efficiency by using hybrid-policy rollouts and retrieval-aware optimization.
AIBullisharXiv โ CS AI ยท Mar 46/102
๐ง Researchers propose NAR-CP, a new method to improve Large Language Models' performance in high-frequency decision-making tasks like UAV pursuit. The approach uses normalized action rewards and consistency policy optimization to address limitations in current LLM-based agents that struggle with rapid, precise numerical state updates.
AIBullisharXiv โ CS AI ยท Mar 37/104
๐ง MIT researchers introduce VCPO (Variance Controlled Policy Optimization), a new method that improves asynchronous reinforcement learning for LLM training by addressing high variance issues in off-policy settings. The technique dynamically scales learning rates and applies variance control to achieve stable training with 2.5x speedup while maintaining performance.
AIBullisharXiv โ CS AI ยท Mar 37/103
๐ง Researchers introduced Scaf-GRPO, a new training framework that overcomes the 'learning cliff' problem in LLM reasoning by providing strategic hints when models plateau. The method boosted Qwen2.5-Math-7B performance on the AIME24 benchmark by 44.3% relative to baseline GRPO methods.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers propose Policy Split, a novel reinforcement learning approach for LLMs that uses dual-mode entropy regularization to balance exploration with task accuracy. By bifurcating policy into normal and high-entropy modes, the method enables diverse behavioral patterns while maintaining performance, showing improvements over existing entropy-guided RL baselines.
AINeutralarXiv โ CS AI ยท 2d ago6/10
๐ง Researchers present a theoretical framework comparing entropy control methods in reinforcement learning for LLMs, showing that covariance-based regularization outperforms traditional entropy regularization by avoiding policy bias and achieving asymptotic unbiasedness. This analysis addresses a critical scaling challenge in RL-based LLM training where rapid policy entropy collapse limits model performance.
AIBullisharXiv โ CS AI ยท 2d ago6/10
๐ง Researchers introduce PODS (Policy Optimization with Down-Sampling), a technique that accelerates reinforcement learning training for large language models by selectively training on high-variance rollouts rather than all generated data. The method achieves equivalent performance to standard approaches at 1.7x faster speeds, addressing computational bottlenecks in LLM reasoning optimization.
AINeutralarXiv โ CS AI ยท 3d ago6/10
๐ง Researchers propose StaRPO, a reinforcement learning framework that improves large language model reasoning by incorporating stability metrics alongside task rewards. The method uses Autocorrelation Function and Path Efficiency measurements to evaluate logical coherence and goal-directedness, demonstrating improved accuracy and reasoning consistency across four benchmarks.
AINeutralarXiv โ CS AI ยท 3d ago6/10
๐ง Researchers propose Visually-Guided Policy Optimization (VGPO), a framework that enhances vision-language models' ability to focus on visual information during reasoning tasks. The method addresses a fundamental limitation where text-dominated VLMs suffer from weak visual attention and temporal visual forgetting, improving performance on multimodal reasoning and visual-dependent tasks.
AINeutralarXiv โ CS AI ยท 6d ago6/10
๐ง Researchers propose T-STAR, a novel reinforcement learning framework that structures multi-step agent trajectories as trees rather than independent chains, enabling better credit assignment for LLM agents. The method uses tree-based reward propagation and surgical policy optimization to improve reasoning performance across embodied, interactive, and planning tasks.
AIBullisharXiv โ CS AI ยท Mar 176/10
๐ง Researchers propose MA-VLCM, a framework that uses pretrained vision-language models as centralized critics in multi-agent reinforcement learning instead of learning critics from scratch. This approach significantly improves sample efficiency and enables zero-shot generalization while producing compact policies suitable for resource-constrained robots.
AIBullisharXiv โ CS AI ยท Mar 166/10
๐ง Researchers introduce CRAFT-GUI, a curriculum learning framework that uses reinforcement learning to improve AI agents' performance in graphical user interface tasks. The method addresses difficulty variation across GUI tasks and provides more nuanced feedback, achieving 5.6% improvement on Android Control benchmarks and 10.3% on internal benchmarks.
AIBullisharXiv โ CS AI ยท Mar 126/10
๐ง Researchers introduce CLIPO (Contrastive Learning in Policy Optimization), a new method that improves upon Reinforcement Learning with Verifiable Rewards (RLVR) for training Large Language Models. CLIPO addresses hallucination and answer-copying issues by incorporating contrastive learning to better capture correct reasoning patterns across multiple solution paths.
AIBullisharXiv โ CS AI ยท Mar 66/10
๐ง Researchers propose EvoTool, a new framework that optimizes AI agent tool-use policies through evolutionary algorithms rather than traditional gradient-based methods. The system decomposes agent policies into four modules and uses blame attribution and targeted mutations to improve performance, showing over 5-point improvements on benchmarks.
๐ง GPT-4
AIBullisharXiv โ CS AI ยท Mar 36/108
๐ง Researchers introduce InfoPO (Information-Driven Policy Optimization), a new method that improves AI agent interactions by using information-gain rewards to identify valuable conversation turns. The approach addresses credit assignment problems in multi-turn interactions and outperforms existing baselines across diverse tasks including intent clarification and collaborative coding.
AIBullisharXiv โ CS AI ยท Mar 37/108
๐ง Researchers propose MemPO (Self-Memory Policy Optimization), a new algorithm that enables AI agents to autonomously manage their memory during long-horizon tasks. The method achieves significant performance improvements with 25.98% F1 score gains over base models while reducing token usage by 67.58%.
AIBullisharXiv โ CS AI ยท Mar 37/109
๐ง Researchers introduce HiMAC, a hierarchical reinforcement learning framework that improves LLM agent performance on long-horizon tasks by separating macro-level planning from micro-level execution. The approach demonstrates state-of-the-art results across multiple environments, showing that structured hierarchy is more effective than simply scaling model size for complex agent tasks.
AIBullisharXiv โ CS AI ยท Mar 36/108
๐ง FlowPortrait is a new reinforcement learning framework that uses Multimodal Large Language Models for evaluation to generate more realistic talking-head videos with better lip synchronization. The system combines human-aligned assessment with policy optimization techniques to address persistent issues in audio-driven portrait animation.
AIBullisharXiv โ CS AI ยท Mar 36/109
๐ง Researchers introduce In-Context Policy Optimization (ICPO), a new method that allows AI models to improve their responses during inference through multi-round self-reflection without parameter updates. The practical ME-ICPO algorithm demonstrates competitive performance on mathematical reasoning tasks while maintaining affordable inference costs.