FAME: Forecastability-Aware Mixture of Experts for Heterogeneous Time Series Forecasting
Researchers introduce FAME, a sparse mixture-of-experts framework that dynamically routes time series forecasting tasks to specialized models based on data characteristics. Tested on a production retail dataset with 5,000+ vending machines, the system achieves 12.4% MSE improvement over single-model baselines while using only 1.92 experts per series, demonstrating practical advantages for large-scale commercial forecasting systems.
FAME addresses a fundamental challenge in production forecasting systems: heterogeneous time series rarely respond well to single unified models. Traditional approaches either lock in one model across all data regimes or deploy dense ensembles that waste computational resources and obscure which models actually work best for different scenarios. This research bridges that gap by learning to recognize data patterns and match them to appropriate experts.
The core innovation lies in the "forecastability fingerprint"—a multidimensional representation capturing each series' lifecycle, volatility, seasonality, and spectral characteristics. Rather than treating expert selection as a static problem, FAME mines validation performance to identify expert-suitability patterns, then trains a sparse router that activates only a budgeted subset of experts per series. This transforms model selection from manual heuristics into a data-driven mining exercise.
The production deployment at Shandong New Beiyang provides meaningful validation beyond academic benchmarks. With over 60 million transactions across 5,000+ machines, the system operates at genuine scale. The 12.4% MSE reduction compared to LightGBM—while averaging just 1.92 expert activations per series—reveals substantial efficiency gains. Lower inference cost directly translates to reduced computational overhead and faster prediction latency in replenishment pipelines.
This work has broader implications for industrial machine learning. As enterprises accumulate diverse datasets with varying statistical properties, routing frameworks become essential infrastructure. The approach suggests forecasting systems should incorporate explicit forecastability assessment rather than applying monolithic models. Future applications likely extend beyond retail to demand planning, resource allocation, and any domain containing heterogeneous temporal data requiring cost-efficient inference.
- →FAME achieves 12.4% MSE improvement over single-model baselines while using only 1.92 experts per series on production retail data
- →Forecastability fingerprinting enables systematic data-driven expert routing instead of heuristic model selection
- →Sparse mixture-of-experts reduces computational inference costs while improving forecast accuracy across heterogeneous time series
- →Production deployment across 5,000+ vending machines validates practical advantages beyond academic benchmarks
- →Framework transforms retail demand forecasting into a data mining problem of expert specialization patterns