PortBench: A Correlation-Aware, Full-Pipeline Benchmark for LLM-Driven Portfolio Management
Researchers introduce PortBench, a comprehensive benchmark for evaluating large language models in portfolio management tasks. The study reveals that 90% of tested LLMs fail to outperform basic equal-weight allocation strategies, highlighting significant gaps between LLM performance on financial QA tasks and real-world portfolio decision-making.
PortBench addresses a critical blind spot in LLM evaluation by introducing the first correlation-aware benchmark for portfolio management. The research distinguishes itself by measuring not just isolated financial knowledge but the ability to construct genuinely diversified portfolios that exploit inter-asset hedging opportunities. This matters because existing benchmarks fail to penalize concentrated portfolios or account for how asset correlations shift during market stress.
The benchmark's dual-layer architecture mirrors real-world portfolio management, combining static correlation-based questions with a dynamic five-stage allocation pipeline. This comprehensive approach exposes a fundamental limitation: LLMs excel at answering individual financial questions but systematically fail at sequential decision-making under portfolio constraints. The introduction of CEPS (Compound Error Pipeline Score) quantifies how reasoning errors compound across multiple decision stages, revealing that procedural compliance doesn't prevent catastrophic performance during market stress.
The finding that 90% of model-profile combinations underperform equal-weight allocation suggests LLMs may introduce complexity without improving outcomes. Critically, models that satisfy every stated constraint still experience severe drawdowns under historical stress regimes, indicating LLMs struggle with real-world risk management scenarios that simple rule-based strategies handle adequately.
For the AI-finance intersection, PortBench establishes new evaluation standards that move beyond isolated capability assessment. This benchmarking approach will likely influence how financial institutions evaluate LLM reliability for decision-support systems. The results suggest that deploying LLMs for portfolio management requires substantial safeguards and human oversight, rather than autonomous decision-making reliance.
- βPortBench introduces the first correlation-aware benchmark specifically designed for evaluating LLM portfolio management capabilities across six asset classes.
- β90% of tested LLMs fail to outperform a basic equal-weight portfolio allocation, despite strong performance on isolated financial questions.
- βA new CEPS metric quantifies how reasoning errors compound across sequential portfolio management stages, revealing systematic decision pipeline failures.
- βLLMs that satisfy all procedural constraints still suffer catastrophic drawdowns under historical market stress conditions, exposing risk management blind spots.
- βThe benchmark establishes evaluation standards that test complete decision pipelines rather than isolated financial knowledge, setting new expectations for AI-finance applications.