Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization
Researchers propose a framework for incorporating Large Language Model (LLM) priors into multi-objective Bayesian optimization while maintaining robustness against miscalibrated advice. Using an objective-wise reputation mechanism and counterfactual gating, the approach dynamically adjusts trust in LLM suggestions based on observed performance rather than accepting them blindly, with empirical validation across molecular optimization tasks.
This research addresses a critical gap in AI-assisted optimization: the assumption that LLM confidence correlates with useful guidance. The study reveals that LLM self-reported confidence often misleads decision-making systems, particularly in multi-objective scenarios where domain expertise varies across different optimization targets. The proposed reputation-market mechanism treats each LLM-objective pairing as an independently verifiable prediction source, updating trust weights dynamically as real-world feedback accumulates.
The work builds on growing recognition that LLMs excel at heuristic suggestion yet fail at calibration. Prior approaches treated LLM guidance as static inputs; this framework introduces adaptive filtering mechanisms that can selectively embrace, conditionally trust, or ignore LLM recommendations. The counterfactual gating system provides three operational modes, enabling fine-grained control over when and how extensively LLM priors influence optimization decisions.
Findings across molecular benchmarks (ESOL, FreeSolv, Lipophilicity) demonstrate context-dependent utility of LLM confidence, contradicting simplistic trust assumptions. On ESOL, higher confidence correlated with greater prediction error, while FreeSolv showed modest benefits and Lipophilicity performed best when ignoring confidence entirely. This heterogeneity underscores why objective-specific calibration matters.
For AI practitioners, this indicates that LLM integration into optimization pipelines requires sophisticated gating mechanisms rather than naive incorporation. The negative result regarding margin-based portfolio selection—suggesting acquisition-function awareness matters more than single-step error prediction—provides valuable guidance for similar applications. Future development should focus on automated confidence-calibration methods and extending these mechanisms to continuous optimization domains.
- →LLM confidence levels are not reliably calibrated to optimization outcomes and require objective-specific gating mechanisms
- →Dynamic reputation-market mechanisms outperform fixed trust weightings for multi-objective Bayesian optimization with LLM priors
- →LLM utility varies significantly across different optimization objectives, necessitating independent calibration rather than global trust scores
- →Raw LLM confidence can actively harm optimization performance on some benchmarks while modestly helping on others
- →Acquisition-function-aware selection strategies outperform simpler margin-based approaches for integrating LLM guidance