When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
Researchers demonstrate that large language models can extract predictive features from financial news with valid intermediate signals (Information Coefficient >0.15), yet these features fail to improve reinforcement learning trading agents during macroeconomic shocks. The findings reveal a critical gap between feature-level validity and downstream policy robustness, suggesting that valid signals alone cannot guarantee trading performance under distribution shifts.
This research addresses a fundamental problem in machine learning-driven finance: the assumption that valid predictive signals automatically improve decision-making systems. The team constructed a pipeline where frozen LLMs extract structured vectors from unstructured financial data, feeding them into PPO-based trading agents. Through automated prompt optimization targeting Information Coefficient rather than traditional NLP metrics, they discovered genuinely predictive features that correlate with realized returns on held-out data.
The critical finding emerges during stress testing. When macroeconomic conditions shift—such as during financial shocks—the LLM-derived features become noisy and actively degrade agent performance compared to price-only baselines. This phenomenon directly parallels known transfer learning failures under distribution shift, where models trained on stable regimes collapse under new conditions.
For the AI-finance community, this exposes a dangerous assumption in feature engineering pipelines. Practitioners often treat feature validation as a discrete step, assuming validated features will improve downstream tasks. This work demonstrates the necessity of end-to-end stress testing across market regimes. Macroeconomic state variables consistently outperform LLM features as policy drivers, suggesting that capturing regime shifts matters more than textual feature richness.
The implications extend beyond trading systems. Any financial AI application relying on LLM-extracted features must validate performance across multiple market regimes, not just in-sample or calm-period tests. The research suggests future work should focus on regime-aware feature extraction or adaptive feature weighting mechanisms that dynamically adjust to macroeconomic conditions rather than static LLM outputs.
- →LLM-extracted features showed valid predictive signals (IC >0.15) but failed to improve trading performance during macroeconomic shocks.
- →Distribution shifts caused by market regime changes transform valid features into noise, demonstrating a critical gap between feature-level and policy-level robustness.
- →Price-only baselines outperformed LLM-augmented agents during stress periods, suggesting textual analysis alone cannot capture macroeconomic risks.
- →Prompt optimization targeting Information Coefficient successfully discovered predictive features, validating the automated tuning methodology.
- →Macroeconomic state variables remained the most robust driver of trading policy improvements across all tested regimes.