Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries
Researchers present MO-PQUCB, a novel algorithm for personalized multi-objective decision-making that combines conversational queries with bandit feedback to learn user preferences more efficiently. The method uses a Plackett-Luce choice model and shift-invariant regularization to overcome fundamental learning barriers, demonstrating improved regret scaling and robustness to corrupted preference signals compared to existing approaches.
This arXiv paper addresses a theoretical challenge in personalized recommendation systems where machines must learn to balance multiple competing objectives while understanding individual user preferences. Traditional multi-objective bandit algorithms treat preference learning passively, inferring priorities only from user feedback on recommended items. The authors recognize that real-world interactions provide richer signals—users naturally articulate their trade-offs through conversational language like 'affordable and clean' when searching for hotels or flights. By formalizing these proactive queries within a mathematical framework, the research proposes that structured preference signals can accelerate learning and improve decision quality.
The core innovation lies in MO-PQUCB's hybrid architecture, which integrates query-based preference anchoring with exploration-exploitation trade-offs through shift-invariant regularization. This addresses a fundamental mathematical barrier where query data alone cannot uniquely determine preferences. The algorithm combines information from both conversational signals and implicit feedback, creating a more robust learning mechanism. The authors provide theoretical regret bounds demonstrating improved scaling compared to preference-aware multi-armed bandit methods.
For practical deployment, the framework extends beyond idealized settings. The paper characterizes performance under corrupted queries—reflecting real-world noise in user communication—and develops estimators that maintain near-optimal guarantees when corruption is sparse. This robustness makes the approach viable for production systems where preference signals may be incomplete or misleading. The theoretical contributions establish fundamental limits on preference learning from corrupted data, providing guidance for system design. Experimental validation confirms both the theoretical predictions and practical utility of the hybrid approach.
- →Proactive conversational queries provide structured preference signals that can accelerate learning in multi-objective personalization systems.
- →MO-PQUCB resolves a shift-invariance barrier by combining query-based anchoring with bandit feedback through dual-exploration mechanisms.
- →The algorithm achieves improved regret scaling compared to preference-aware multi-armed bandit baselines in theoretical and empirical settings.
- →Framework includes robust estimation techniques that maintain near-optimal performance under sparse corruption of user preference signals.
- →Research bridges gap between academic multi-objective bandits and practical personalized recommendation systems using conversational interfaces.