Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk
Researchers present Nested Causal Thompson Sampling (NCTS), a machine learning framework for sequential decision-making where strategic choices causally influence subsequent tactical decisions across multiple timescales. The work introduces PAC-Bayesian risk bounds that enable off-policy certification of deployment policies from historical data alone, enabling safer handover from legacy systems to learned agents.
This research addresses a fundamental gap in reinforcement learning and bandit theory: most frameworks assume decisions occur on a single timescale, but real-world systems involve hierarchical causality where high-level strategic choices reshape the context for lower-level tactical decisions. The Nested Contextual Causal Bandits framework formalizes this structure using structural causal models, enabling more realistic problem representation than standard approaches.
The theoretical contribution centers on providing certified risk guarantees without requiring future online interaction. The PAC-Bayesian bound allows practitioners to evaluate whether a candidate policy is trustworthy for deployment based solely on historical data, addressing a critical bottleneck in deploying learned systems in high-stakes domains. This "anytime" certification capability is particularly valuable for industries like healthcare, finance, and autonomous systems where deployment decisions cannot wait for convergence guarantees.
Experimental results demonstrate significant practical advantages. The mechanism-factorized approach transfers substantially better under distribution shifts compared to joint regression baselines, a critical property for real-world deployment. The recursive meta-to-inner commitment strategy outperforms joint-commitment alternatives, suggesting hierarchical decision structures genuinely improve performance when properly leveraged.
The progressive certified handover framework enables gradual, independently-timed transitions from legacy controllers to learned policies at each timescale level, reducing deployment risk. This staged approach provides natural checkpoints where certification thresholds trigger human-validated transitions rather than requiring simultaneous system-wide replacement. For industries managing complex sequential decisions with safety constraints, this work establishes principled methodology for trustworthy agent deployment.
- βNCTS provides certified off-policy risk bounds enabling safe deployment decisions from historical data without online interaction
- βMechanism-factorized posteriors significantly outperform joint regression baselines under distribution shifts
- βRecursive hierarchical commitment structures provide superior performance compared to single-level alternatives
- βProgressive certified handover enables independent, staged transitions from legacy to learned policies across decision timescales
- βFramework applies to critical sequential decision domains including healthcare, finance, and autonomous systems