An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers
Researchers empirically compared eight input encoder architectures for Transformer models processing multi-channel signal data, finding that the standard per-channel linear projection matches all alternatives in performance while being simplest to implement. Two encoders underperformed significantly: shared-scalar baselines and channel-independent architectures, with practical differences between top performers remaining statistically small but modest.
This empirical study addresses a fundamental architectural decision in Transformer-based time series models: how to efficiently encode multiple simultaneous input channels into a unified representation. The research systematically evaluates eight encoder variants on synthetic benchmarks designed to test channel identity preservation and real-world ETTh1 data, measuring performance via next-step negative log-likelihood. The findings reveal a practical hierarchy where most sophisticated approaches converge to similar performance levels, with the simple per-channel linear projection (nn.Linear) emerging as competitive with more complex alternatives.
The research contributes to growing efforts in machine learning to validate architectural choices empirically rather than rely on intuition. The shared-scalar baseline's failure is attributed to information-theoretic constraints, providing theoretical grounding for why certain approaches fundamentally cannot encode multiple channels effectively into single vectors. The channel-independent architecture's universal underperformance and overfitting suggest that treating channels as entirely independent loses valuable cross-channel dependencies in temporal signal processing.
For practitioners developing time series models, this work suggests that architectural complexity around input encoding delivers diminishing returns. The study identifies two narrow cases where added sophistication provides measurable benefits: projected sinusoidal positional encodings excel at small channel counts through positional-channel orthogonalization, while nonlinear MLP stems show advantages at larger channel counts, though benefits shrink with additional training data. The research supports a principle of starting with proven baselines before introducing complexity, reducing hyperparameter search space and improving reproducibility in time series modeling applications.
- βStandard per-channel linear projection (nn.Linear) matches all sophisticated encoder alternatives with practical equivalence on both synthetic and real benchmarks.
- βShared-scalar and channel-independent encoders fail decisively for information-theoretic reasons, with the latter causing universal overfitting on synthetic tasks.
- βProjected positional encodings achieve best performance at small channel counts through positional-channel orthogonalization, a mechanism empirically demonstrated via geometric probing.
- βNonlinear MLP stem encoders marginally outperform at largest tested channel counts, but improvements diminish significantly with more training data.
- βReproducible code and datasets enable verification of all experimental results and support this empirical methodology in machine learning research.