#preference-optimization News & Analysis

23 articles tagged with #preference-optimization. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

23 articles

AIBullisharXiv – CS AI · 2d ago7/10

🧠

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

Researchers introduce LLUMI, an open-source LLM system for mental health support that uses community feedback from Reddit to improve response quality without relying on proprietary cloud models. The approach achieves comparable performance to GPT models while offering better privacy protection for sensitive health contexts.

AIBullisharXiv – CS AI · May 117/10

🧠

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

Researchers introduce CASPO, a framework that improves reasoning reliability in large language models by aligning token-level confidence with step-wise logical correctness through preference optimization. The method achieves better performance than tree-search approaches without requiring separate reward models, while introducing CaT inference that dynamically prunes uncertain reasoning branches with minimal computational overhead.

AIBullisharXiv – CS AI · May 77/10

🧠

RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization

Researchers introduce RLearner-LLM, a hybrid optimization method that combines NLI (Natural Language Inference) signals with LLM verification to address a critical flaw in Direct Preference Optimization: the tendency to reward verbose but logically incorrect outputs. The approach achieves up to 6x improvement in logical consistency across academic domains while maintaining inference speed, demonstrating that logic-aware metrics outperform traditional LLM-based evaluation for knowledge-intensive tasks.

🧠 GPT-4

AIBullisharXiv – CS AI · Mar 57/10

🧠

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Researchers have developed SafeDPO, a simplified approach to training large language models that balances helpfulness and safety without requiring complex multi-stage systems. The method uses only preference data and safety indicators, achieving competitive safety-helpfulness trade-offs while eliminating the need for reward models and online sampling.

AIBullisharXiv – CS AI · Mar 37/102

🧠

Towards Safe Reasoning in Large Reasoning Models via Corrective Intervention

Researchers propose Intervened Preference Optimization (IPO) to address safety issues in Large Reasoning Models, where chain-of-thought reasoning contains harmful content even when final responses appear safe. The method achieves over 30% reduction in harmfulness while maintaining reasoning performance.

AIBullisharXiv – CS AI · Feb 277/106

🧠

Dual-IPO: Dual-Iterative Preference Optimization for Text-to-Video Generation

Researchers introduce Dual-Iterative Preference Optimization (Dual-IPO), a new method that iteratively improves both reward models and video generation models to create higher-quality AI-generated videos better aligned with human preferences. The approach enables smaller 2B parameter models to outperform larger 5B models without requiring manual preference annotations.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

Reasoning Matters: Mitigate Hallucination in Multimodal Large Reasoning Models via Reasoning-Conditioned Preference Optimization

Researchers propose Reasoning-Conditioned Direct Preference Optimization (RC-DPO), a training method that reduces hallucinations in multimodal large reasoning models by treating chain-of-thought reasoning as a condition for answer generation rather than a monolithic output. The approach uses Monte Carlo Tree Search to generate better training data and demonstrates improved reliability across multiple benchmarks.

AINeutralarXiv – CS AI · 4d ago6/10

🧠

Advancing Creative Physical Intelligence in Large Multimodal Models

Researchers introduce MM-CreativityBench, a benchmark testing whether large multimodal models can solve creative physical problems by identifying non-obvious tool uses in constrained environments. Current LMMs struggle not from lack of generation capability but from poor visual grounding, hallucinating attributes and overlooking relevant entities; the team proposes affordance-grounded alignment using preference learning to improve performance.

AINeutralarXiv – CS AI · May 126/10

🧠

EvoPref: Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent

Researchers introduce EvoPref, a multi-objective evolutionary algorithm that optimizes LLM alignment across multiple objectives using population-based methods rather than traditional gradient descent. The approach demonstrates 18% improvement in preference coverage and 47% reduction in preference collapse while maintaining competitive alignment quality compared to gradient-based methods like ORPO.

AINeutralarXiv – CS AI · May 116/10

🧠

Implicit Preference Alignment for Human Image Animation

Researchers propose Implicit Preference Alignment (IPA), a machine learning framework that improves hand motion generation in human image animation without requiring expensive paired preference data. The method uses self-generated samples and a hand-aware optimization mechanism to enhance animation quality while reducing data curation overhead.

AINeutralarXiv – CS AI · May 116/10

🧠

Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph

Researchers introduce Graph Direct Preference Optimization (GraphDPO), an advancement over standard DPO that leverages full preference structures from multiple rollouts per prompt rather than collapsing data into independent pairs. The method maintains computational efficiency while improving stability and performance on reasoning and program synthesis tasks by enforcing transitivity and reducing conflicting supervision signals.

AINeutralarXiv – CS AI · May 116/10

🧠

MaPPO: Maximum a Posteriori Preference Optimization with Prior Knowledge

Researchers introduce MaPPO, a new preference optimization method for large language models that integrates prior reward knowledge into the training objective. Building on Direct Preference Optimization (DPO), MaPPO demonstrates consistent improvements across multiple benchmarks while maintaining computational efficiency and compatibility with existing DPO variants.

AIBullisharXiv – CS AI · Apr 206/10

🧠

FSPO: Few-Shot Optimization of Synthetic Preferences Personalizes to Real Users

Researchers propose FSPO (Few-Shot Preference Optimization), a meta-learning algorithm that personalizes large language models using minimal user preference data. The approach uses synthetically generated preferences to train models that can quickly adapt to individual user preferences, achieving 87% performance on synthetic users and 70% on real human users in evaluation tasks.

AINeutralarXiv – CS AI · Apr 206/10

🧠

CLewR: Curriculum Learning with Restarts for Machine Translation Preference Learning

Researchers introduce CLewR, a curriculum learning strategy that improves machine translation performance in large language models by reordering training data from easy to hard examples with periodic restarts. The approach demonstrates consistent improvements across multiple model families and preference optimization techniques, addressing a previously underexplored aspect of LLM training methodology.

🧠 Llama

AIBullisharXiv – CS AI · Apr 156/10

🧠

GoodPoint: Learning Constructive Scientific Paper Feedback from Author Responses

Researchers introduce GoodPoint, an AI system trained to generate constructive scientific feedback by learning from author responses to peer review. The method improves feedback quality by 83.7% over baseline models and outperforms larger LLMs like Gemini-3-flash, demonstrating that specialized training on valid, actionable feedback signals yields better results than general-purpose models.

🧠 Gemini

AINeutralarXiv – CS AI · Apr 146/10

🧠

FinTrace: Holistic Trajectory-Level Evaluation of LLM Tool Calling for Long-Horizon Financial Tasks

Researchers introduced FinTrace, a benchmark dataset with 800 expert-annotated trajectories for evaluating how large language models perform financial tool-calling tasks. The study reveals that while frontier LLMs excel at selecting appropriate tools, they struggle significantly with information utilization and generating accurate final outputs, pointing to a critical reasoning gap that persists even after fine-tuning with preference optimization techniques.

AIBullisharXiv – CS AI · Apr 146/10

🧠

SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

Researchers propose SVSR, a self-verification and self-rectification framework that enhances multimodal AI reasoning through a three-stage training approach combining preference datasets, supervised fine-tuning, and semi-online direct preference optimization. The method demonstrates improved accuracy and generalization across visual understanding tasks while maintaining performance even without explicit reasoning traces.

AIBullisharXiv – CS AI · Apr 146/10

🧠

Mobile GUI Agent Privacy Personalization with Trajectory Induced Preference Optimization

Researchers propose Trajectory Induced Preference Optimization (TIPO), a novel method for training mobile GUI agents to respect user privacy preferences while maintaining task execution capability. The approach addresses the challenge that privacy-conscious users generate structurally different execution patterns than utility-focused users, requiring specialized optimization techniques to properly align agent behavior with individual privacy preferences.

AIBullisharXiv – CS AI · Mar 36/104

🧠

Solving the Granularity Mismatch: Hierarchical Preference Learning for Long-Horizon LLM Agents

Researchers introduce Hierarchical Preference Learning (HPL), a new framework that improves AI agent training by using preference signals at multiple granularities - trajectory, group, and step levels. The method addresses limitations in existing Direct Preference Optimization approaches and demonstrates superior performance on challenging agent benchmarks through a dual-layer curriculum learning system.

AIBullisharXiv – CS AI · Mar 36/104

🧠

When Data is the Algorithm: A Systematic Study and Curation of Preference Optimization Datasets

Researchers conducted the first comprehensive analysis of open-source direct preference optimization (DPO) datasets used to align large language models, revealing significant quality variations. They created UltraMix, a curated dataset that's 30% smaller than existing options while delivering superior performance across benchmarks.

AIBullisharXiv – CS AI · Mar 27/1014

🧠

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Researchers propose MetaAPO, a new framework for aligning large language models with human preferences that dynamically balances online and offline training data. The method uses a meta-learner to evaluate when on-policy sampling is beneficial, resulting in better performance while reducing online annotation costs by 42%.

AINeutralarXiv – CS AI · Apr 105/10

🧠

Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search

Researchers introduce MSPA-CQR, a machine learning approach that improves conversational query rewriting by aligning preferences across three dimensions: query rewriting, passage retrieval, and response generation. The method uses self-consistent preference data and direct preference optimization to generate more diverse and effective rewritten queries in conversational search systems.

AINeutralHugging Face Blog · Jul 104/107

🧠

Preference Optimization for Vision Language Models

The article title indicates a focus on preference optimization techniques for Vision Language Models, which are AI systems that process both visual and textual information. This represents ongoing research in improving how these multimodal AI models align with human preferences and perform tasks.