y0news
AnalyticsDigestsSourcesTopicsRSSAICrypto

#reward-optimization News & Analysis

5 articles tagged with #reward-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

5 articles
AINeutralarXiv – CS AI · Jun 26/10
🧠

Value-Free Policy Optimization via Reward Partitioning

Researchers introduce Reward Partition Optimization (RPO), a new method for training language models that eliminates the need for value function estimation in preference-based learning. RPO simplifies the optimization process by normalizing rewards through partition-based formulations, demonstrating superior performance compared to existing approaches like DRO and KTO across multiple model architectures.

AINeutralarXiv – CS AI · May 296/10
🧠

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

Researchers introduce MuPHI, a dataset and training framework for detecting implicit multimodal harm in image-text pairs where danger emerges from context-dependent reasoning rather than surface features. The proposed MuPHIRM framework uses reward optimization to improve vision-language models' ability to reason about compositional harm while demonstrating stronger generalization to out-of-distribution scenarios.

AINeutralarXiv – CS AI · May 286/10
🧠

Mechanistically Interpreting the Role of Sample Difficulty in RLVR for LLMs

Researchers mechanistically analyze how sample difficulty affects Reinforcement Learning with Verifiable Reward (RLVR) training in large language models, discovering that medium-difficulty problems yield optimal reasoning improvements while overly hard problems degrade performance. The study proposes difficulty-adaptive strategies using backward-reasoning reformulation and sparse autoencoders to optimize reward signals during training.

AIBullisharXiv – CS AI · May 116/10
🧠

BalCapRL: A Balanced Framework for RL-Based MLLM Image Captioning

Researchers introduce BalCapRL, a reinforcement learning framework that improves multimodal image captioning by balancing three competing objectives: utility-aware correctness, reference coverage, and linguistic quality. The method achieves significant performance gains across multiple models by applying reward-decoupled normalization and length-conditional masking, addressing the trade-offs present in existing captioning approaches.

AIBullisharXiv – CS AI · Mar 36/104
🧠

Iterative Distillation for Reward-Guided Fine-Tuning of Diffusion Models in Biomolecular Design

Researchers propose a new iterative distillation framework for fine-tuning diffusion models in biomolecular design that optimizes for specific reward functions. The method addresses stability and efficiency issues in existing reinforcement learning approaches by using off-policy data collection and KL divergence minimization for improved training stability.