NumLeak: Public Numeric Benchmarks as Latent Labels in Foundation Models
Researchers introduce NumLeak, a framework revealing that frontier large language models memorize public numeric benchmarks from pretraining data rather than genuinely understanding underlying concepts. The study demonstrates that models achieve near-perfect recall on financial and economic metrics when prompted with dates, but this performance collapses on recent holdout data, indicating memorization rather than reasoning capability.
The NumLeak framework exposes a critical vulnerability in how frontier LLMs are evaluated: their apparent mastery of quantitative domains may reflect data memorization rather than genuine analytical capability. By testing models on public benchmarks like Fama-French factors, unemployment data, and inflation metrics, researchers found that top-tier systems achieved correlation coefficients of 0.97-0.99 on historical data but saw parse rates drop to 21-57% on recent unseen periods. This asymmetry between memorized and novel data reveals a fundamental gap between perceived and actual model competence.
The research leverages both black-box API probing and white-box experiments to validate findings across different architectural approaches. The collapse from r=0.74 to r=0.02 when controlling for the model's own recalled values demonstrates that apparent market-sentiment understanding dissolves once memorized components are removed. These findings matter substantially for financial services, where institutions increasingly deploy LLMs for analysis and forecasting. Organizations cannot reliably trust model outputs on quantitative domains without understanding whether responses reflect genuine reasoning or regurgitated training data.
The identified defense mechanism—a single-line system prompt blocking 99.8% of memorization attacks with minimal utility cost—provides practical mitigation. However, the broader implication suggests that current evaluation methodologies systematically overstate LLM capabilities on domains where public benchmarks exist in training data. Financial institutions, AI developers, and researchers must recalibrate confidence in model-based analysis and implement stronger validation protocols that explicitly test generalization beyond training distributions.
- →Frontier LLMs achieve 0.97-0.99 correlation on historical financial metrics but collapse to 21-57% parse rates on recent unseen data, indicating memorization rather than reasoning
- →Residualizing out model-recalled values reduces apparent market-sentiment understanding from r=0.74 to r=0.02, revealing genuine analytical capability is minimal
- →White-box experiments and logprob ranking detect memorization that open-ended generation masks, suggesting API-based evaluations understate the memorization channel
- →A single-line system-prompt defense blocks 99.8% of memorization attacks with near-zero utility cost on legitimate queries
- →Current LLM evaluation methodologies systematically overstate capabilities on domains where public benchmarks appear in training data