From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets
Researchers introduce KTD-Fin, a benchmark that addresses critical evaluation flaws in LLM trading agent testing by masking market identifiers to prevent memorization and using attribution analysis to isolate genuine alpha. Testing on 10 frontier LLM agents reveals that their trading returns stem primarily from passive market and style exposure rather than transferable investment skill.
The research exposes a fundamental methodological problem in evaluating LLM trading capabilities: frontier models like GPT-4 possess knowledge cutoffs that overlap with historical backtesting periods, enabling agents to generate convincing investment rationales based on memorized market data rather than genuine reasoning. This creates an illusion of competence where agents appear profitable but lack true analytical insight. KTD-Fin addresses this by anonymizing ticker symbols, dates, and price information during evaluation, forcing agents to rely on underlying financial principles instead of pattern matching against known outcomes.
The attribution framework represents the second critical innovation, decomposing returns into market beta, style factors, and alpha components. This distinction matters enormously because positive portfolio performance can result from passive exposure to broad market gains or sector preferences rather than superior stock selection. The benchmark's findings across Chinese CSI300 trading demonstrate that current LLM agents struggle to generate meaningful alpha when memory leakage is prevented, suggesting their apparent trading prowess reflects memorization artifacts rather than transferable investment skill.
For the AI and fintech industries, these findings impose necessary discipline on claims about LLM trading capabilities. They highlight that benchmarking methodologies must control for knowledge contamination and measure sources of returns, not merely outcomes. This work establishes evaluation standards that distinguish genuine financial reasoning from superficial pattern recognition, crucial for developers building financial AI systems and institutions considering LLM deployment in investment workflows. The reproducible template KTD-Fin provides creates a foundation for more rigorous future assessments.
- βLLM agents' apparent trading profitability largely results from memorized market data rather than genuine investment reasoning.
- βAnonymizing market identifiers and dates forces agents toward legitimate financial analysis, substantially changing their decision rationales.
- βAttribution analysis reveals most LLM trading returns come from passive market and style exposure, not alpha generation.
- βCurrent frontier LLMs show limited evidence of persistent stock-selection skill when memory leakage is controlled.
- βKTD-Fin establishes reproducible evaluation standards for assessing transferable financial skill in AI trading agents.