🧠 AI⚪ NeutralImportance 5/10

Multi-Armed Bandits With Best-Action Queries

arXiv – CS AI|Francesco Bacchiocchi, Matteo Castiglioni, Alberto Marchesi, Francesco Emanuele Stradi|May 12, 2026 at 04:00 AM

🤖AI Summary

Researchers resolve an open problem in multi-armed bandit theory by characterizing how best-action oracle queries improve learning algorithms in the realistic bandit-feedback model. They prove that benefits depend critically on reward structure: correlated stochastic rewards cannot achieve the theoretical gains seen in full-feedback settings, while i.i.d. stochastic rewards maintain near-optimal improvements with logarithmic precision.

Analysis

This theoretical computer science paper advances machine learning optimization by providing complete characterization of oracle-augmented bandits in practical settings. The researchers definitively answer whether insights from Russo et al.'s full-feedback work transfer to bandit feedback, where only the selected arm's reward is observed—a constraint that mirrors real-world learning constraints across recommendation systems, clinical trials, and adaptive control problems.

The key finding reveals a sharp distinction based on reward correlation structure. When arm rewards correlate, the learner cannot exploit best-action queries to reduce regret below Ω(√(T-k)), fundamentally limiting query utility. This negative result extends to adversarial environments, establishing that correlation creates irreducible uncertainty. Conversely, with i.i.d. stochastic rewards, optimal Õ(min{T/k, √(T-k)}) regret remains achievable, matching information-theoretic lower bounds.

For the broader machine learning community, this work clarifies when expensive oracle access justifies its cost. Decision-making systems can now assess whether reward dependencies in their domains suggest query benefits or necessitate alternative approaches. The matching upper and lower bounds provide theoretical closure, eliminating gaps that previously left algorithm design ambiguous.

The results have implications for algorithm design in online learning platforms, where oracle queries might represent expert consultation or additional computational resources. The correlation-dependent characterization suggests practitioners must carefully evaluate their specific reward structures before investing in oracle infrastructure.

Key Takeaways

→Best-action queries reduce bandit regret from Õ(√T) to Õ(min{T/k, √T}) only when rewards are i.i.d., not universally across all stochastic settings
→Correlated rewards prevent improvement below Ω(√(T-k)) regret regardless of query count, making oracle access ineffective in dependent environments
→The gap between full-feedback and bandit-feedback models reveals information-theoretic limits that cannot be overcome through algorithm design alone
→Matching upper and lower bounds provide complete theoretical characterization, eliminating ambiguity in bandit-oracle algorithm analysis
→Results apply to both stochastic and adversarial reward settings, establishing broad applicability across different learning environment assumptions