MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios
MacroLens is a new financial reasoning benchmark that combines price history, accounting fundamentals, macroeconomic data, and news text to evaluate AI models on seven financial tasks across 4,416 U.S. small- and micro-cap stocks. The dataset addresses critical evaluation challenges unique to finance and tests 19 methods ranging from heuristics to frontier LLMs, providing a standardized tool for developing contextual financial AI systems.
MacroLens tackles a fundamental gap in AI benchmarking by creating the first comprehensive dataset that jointly evaluates models on multimodal financial reasoning. Traditional time-series benchmarks fail in finance because they ignore critical constraints: news embargoes prevent look-ahead bias, accounting statements lag by months, filing text overlaps with numerical data, and macroeconomic regimes don't align with calendar splits. The benchmark covers 2021-2026 across 4,416 equities, integrating 46.8M accounting facts, 53 macro series, 296K SEC filings, and 216K news articles, plus 1,130 automatically detected macroeconomic events rendered as natural language scenarios.
The benchmark's scope directly addresses real-world financial decision-making complexity. Rather than isolating signals, MacroLens forces models to synthesize heterogeneous data sources under realistic temporal constraints—a requirement absent from existing benchmarks. This matters because institutional investors and fintech platforms already operate this way; standardizing evaluation helps identify which AI approaches actually generalize to production environments.
For the AI industry, MacroLens establishes infrastructure for training and evaluating financial reasoning models at scale. The evaluation of 19 methods across six families—from gradient boosting to zero-shot LLMs—provides empirical grounding for comparing approaches. The five-step feature-context ablation on frontier LLMs offers insights into which signal combinations drive performance, guiding future model architecture decisions.
Looking ahead, broader adoption of multimodal financial benchmarks will likely accelerate development of hybrid models combining time-series expertise with language understanding. MacroLens' public release on Hugging Face may inspire competing benchmarks addressing other asset classes, geographies, or prediction horizons, gradually shifting the field toward more realistic evaluation standards.
- →MacroLens is the first benchmark jointly evaluating AI on price history, accounting fundamentals, macroeconomic data, and news for financial reasoning.
- →The dataset covers 4,416 U.S. small- and micro-cap stocks from 2021-2026 with 46.8M accounting facts, 53 macro series, and 296K SEC filings.
- →Evaluation of 19 methods across six families reveals performance gaps between naive heuristics, time-series models, fine-tuned LLMs, and zero-shot approaches.
- →MacroLens addresses four time-series evaluation challenges unique to finance: publication date gating, reporting lags, text-data redundancy, and regime leakage.
- →The benchmark's public release on Hugging Face establishes infrastructure for standardized financial AI evaluation and model development.