AINeutralarXiv – CS AI · 15h ago6/10
🧠
Linear and Neural Dueling Bandits with Delayed Feedback
Researchers propose novel algorithms (LDB-DF and NDB-DF) for contextual dueling bandits that handle delayed feedback—a critical real-world constraint in recommender systems and LLM alignment. The breakthrough involves an Inverse Probability Weighting mechanism that eliminates bias from delayed observations, achieving theoretical regret bounds of O(d√T) for linear settings.