#contextual-bandits News & Analysis

14 articles tagged with #contextual-bandits. AI-curated summaries with sentiment analysis and key takeaways from 50+ sources.

14 articles

AINeutralarXiv – CS AI · Apr 67/10

🧠

Jump Start or False Start? A Theoretical and Empirical Evaluation of LLM-initialized Bandits

Research examines how Large Language Models can be used to initialize contextual bandits for recommendation systems, finding that LLM-generated preferences remain effective up to 30% data corruption but can harm performance beyond 50% corruption. The study provides theoretical analysis showing when LLM warm-starts outperform cold-start approaches, with implications for AI-driven recommendation systems.

AIBullisharXiv – CS AI · Feb 277/109

🧠

Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Researchers achieved breakthrough sample complexity improvements for offline reinforcement learning algorithms using f-divergence regularization, particularly for contextual bandits. The study demonstrates optimal O(ε⁻¹) sample complexity under single-policy concentrability conditions, significantly improving upon existing bounds.

$NEAR

AINeutralarXiv – CS AI · Feb 277/107

🧠

Learning to Answer from Correct Demonstrations

Researchers propose a new approach for training AI models to generate correct answers from demonstrations, using imitation learning in contextual bandits rather than traditional supervised fine-tuning. The method achieves better sample complexity and works with weaker assumptions about the underlying reward model compared to existing likelihood-maximization approaches.

AIBullisharXiv – CS AI · 3d ago6/10

🧠

OrcaRouter: A Production-Oriented LLM Router with Hybrid Offline-Online Learning

OrcaRouter is a production-ready LLM routing system that uses contextual bandits and hybrid offline-online learning to intelligently direct requests to the most appropriate language model. The system ranked second on the RouterArena leaderboard with 75.54% accuracy while maintaining low inference costs of $1.00 per 1,000 queries.

AINeutralarXiv – CS AI · 6d ago6/10

🧠

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

Researchers introduce a multi-agent framework that combines contextual bandits with semantic checkpoints to prevent 'semantic drift' in automated scientific computing workflows. The system ensures that computational strategies selected by AI agents are faithfully executed and remain causally attributable throughout multi-agent pipelines, improving convergence and robustness in adaptive decision-making.

AINeutralarXiv – CS AI · 6d ago6/10

🧠

The Sample Complexity of Multiclass and Sparse Contextual Bandits

Researchers present optimal algorithms for sparse contextual bandits that achieve sample complexity of Õ((s/ε² + |A|/ε)log|Π|/δ), closing a gap from prior work that had exponential dependence on action set size. The results apply to multiclass classification and combinatorial semi-bandits through information-theoretic and algorithmic approaches.

AINeutralarXiv – CS AI · May 285/10

🧠

Learning to Assign Prediction Tasks to Agents with Capacity Constraints

Researchers propose a machine learning framework for optimally assigning prediction tasks to heterogeneous agents (humans or AI systems) subject to capacity constraints. The work develops explore-exploit algorithms that learn agent expertise and adapt assignments dynamically, demonstrating improvements over baseline approaches across tabular, image, and text tasks.

AINeutralarXiv – CS AI · May 276/10

🧠

Linear and Neural Dueling Bandits with Delayed Feedback

Researchers propose novel algorithms (LDB-DF and NDB-DF) for contextual dueling bandits that handle delayed feedback—a critical real-world constraint in recommender systems and LLM alignment. The breakthrough involves an Inverse Probability Weighting mechanism that eliminates bias from delayed observations, achieving theoretical regret bounds of O(d√T) for linear settings.

AINeutralarXiv – CS AI · May 126/10

🧠

Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability

Researchers achieve the first fast statistical rates (Õ(ε⁻¹)) for offline contextual bandits using forward-KL regularization under single-policy concentrability, matching the performance previously only shown for reverse-KL approaches and establishing rate-optimal lower bounds.

AINeutralarXiv – CS AI · May 16/10

🧠

Learning When to Remember: Risk-Sensitive Contextual Bandits for Abstention-Aware Memory Retrieval in LLM-Based Coding Agents

Researchers introduce RSCB-MC, a risk-sensitive contextual bandit system that improves how LLM-based coding agents decide whether to use external memory for debugging tasks. Rather than treating memory retrieval as a simple similarity-matching problem, the system treats it as a safety-critical control problem, achieving 62.5% success rate with zero false positives in testing.

AIBullisharXiv – CS AI · Mar 55/10

🧠

Online Learning for Multi-Layer Hierarchical Inference under Partial and Policy-Dependent Feedback

Researchers developed a new variance-reduced EXP4-based algorithm for optimizing routing policies in multi-layer hierarchical inference systems. The solution addresses the challenge of sparse, policy-dependent feedback in AI systems where prediction errors are only revealed at terminal layers, improving stability and performance over standard importance-weighted approaches.

AIBearisharXiv – CS AI · Mar 37/106

🧠

Learning to Attack: A Bandit Approach to Adversarial Context Poisoning

Researchers developed AdvBandit, a new black-box adversarial attack method that can exploit neural contextual bandits by poisoning context data without requiring access to internal model parameters. The attack uses bandit theory and inverse reinforcement learning to adaptively learn victim policies and optimize perturbations, achieving higher victim regret than existing methods.

AINeutralarXiv – CS AI · Mar 174/10

🧠

Learning When to Trust in Contextual Bandits

Researchers propose CESA-LinUCB, a new approach to robust reinforcement learning that addresses 'Contextual Sycophancy' where evaluators are truthful in normal situations but biased in critical contexts. The method learns trust boundaries for each evaluator and achieves sublinear regret even when no evaluator is globally reliable.

AINeutralarXiv – CS AI · Mar 94/10

🧠

Structured Exploration vs. Generative Flexibility: A Field Study Comparing Bandit and LLM Architectures for Personalised Health Behaviour Interventions

A 4-week study comparing bandit algorithms and LLM architectures for personalized health behavior interventions found that LLM-based messaging approaches were rated more helpful than templates, but contextual bandit optimization provided no additional benefit over LLM-only methods. The research reveals a trade-off between structured exploration of behavior change techniques and generative flexibility in AI health systems.