Leakage-Aware Benchmarking of LLM Forecasting: Real-Time Nowcasts as the Decision-Time Input for Macro Factor Ranking
Researchers benchmark a retrieval-augmented LLM system for equity factor ranking using strictly decision-time information, avoiding data leakage common in forecasting benchmarks. The 7B model achieves modest positive results (median IC +0.154) comparable to simpler kNN baselines, suggesting real-time macro data and historical analogies drive most signal while LLMs may add marginal value in extreme rankings.
This research addresses a critical methodological flaw in LLM forecasting benchmarks: the widespread practice of training or evaluating models on features that wouldn't be available at actual decision time, which artificially inflates apparent performance. The authors construct a rigorous experimental framework spanning three years (April 2023 to March 2026) where their LLM system observes only information truly available month-end for equity factor ranking decisions. The pipeline combines macro-analog retrieval with critic and actor LLMs to score seven U.S. equity style factors, achieving a median rank correlation of +0.154 with consistency across multiple 12-month windows. However, statistical significance remains elusive given the confidence interval includes zero.
The study reveals that much of the LLM system's predictive power derives not from sophisticated language understanding but from simpler components: lagged macro variables, recent event summaries, and Cleveland Fed inflation nowcasts. A non-LLM kNN baseline matching the decision-time constraint recovers comparable median performance, suggesting that macro-similar historical state selection explains the bulk of signal. Where LLMs show potential advantage is in extreme rank scores that matter for long-short portfolio construction, indicating language models may extract nuance from edge cases rather than bulk predictions.
For AI practitioners and quant investors, this study validates caution around benchmark inflation while demonstrating that LLM value in forecasting remains context-dependent and marginal. The work establishes a gold-standard experimental methodology for future LLM forecasting research. Practitioners should skeptically evaluate forecasting claims lacking explicit decision-time constraints and recognize that simpler baselines often capture the majority of achievable signal in macro prediction tasks.
- βLLM forecasting benchmarks commonly suffer from data leakage; this study enforces strict decision-time information constraints to measure true capability.
- βA 7B LLM system achieved +0.154 median monthly Spearman IC on equity factor ranking, statistically underpowered but consistent across subperiods.
- βNon-LLM kNN baseline recovered comparable median performance, suggesting real-time macro data and historical similarity drive most predictive signal.
- βLLMs showed marginal advantage concentrated in extreme rankings used for long-short portfolio formation rather than bulk predictions.
- βThe research establishes methodological best practices for leakage-aware LLM forecasting that should inform future benchmarking in financial prediction.