ITNet: A Learnable Integral Transform That Subsumes Convolution, Attention, and Recurrence
Researchers introduce ITNet, a unified neural network architecture built on learnable integral transforms that mathematically subsumes convolutional networks, transformers, and recurrent networks as special cases. The model demonstrates that these three historically distinct architectural families can emerge from a single underlying mathematical framework, with experiments showing competitive performance across vision, language, and multimodal tasks.
ITNet represents a significant theoretical advancement in deep learning by unifying three dominant architectural paradigms under a single mathematical umbrella. Rather than treating convolution, attention, and recurrence as fundamentally different approaches, the researchers demonstrate these are manifestations of learnable integral transforms with different parameterizations. The kernel operates as a small neural network that learns pairwise interactions between positions and features, enabling the architecture to adapt dynamically to data without architectural constraints.
This work builds on decades of architectural specialization in deep learning. The field evolved separately: CNNs dominated vision through local inductive biases, transformers revolutionized NLP via content-dependent attention, and RNNs captured sequential dependencies. Yet ITNet's theoretical framework suggests this fragmentation was unnecessary—a more general operator could subsume all three families.
The practical implications are substantial for the AI development community. A unified architecture reduces engineering complexity, as researchers need not choose between specialized models for different modalities. The authors demonstrate this through experiments on ImageNet-1K, GLUE, ModelNet40, VQA v2, and NLVR2, where a single ITNet variant with shared operators and lightweight modality-specific encoders matches or exceeds task-specific baselines. This efficiency gain matters for resource-constrained environments and democratizes access to multi-task capable systems.
Future work will likely explore whether this unification extends to even larger models and whether the learned kernels reveal insights into optimal signal processing strategies. The framework also opens questions about whether certain parameterizations emerge consistently across domains, potentially revealing universal principles of information processing.
- →ITNet proves convolution, self-attention, and recurrence are special cases of a single learnable integral transform architecture.
- →A unified model with shared operators matches specialized baselines across vision, language, and multimodal benchmarks.
- →The framework suggests decades of architectural fragmentation reflected incomplete mathematical understanding rather than fundamental diversity.
- →Practical innovations including tiled kernel fusion and importance-weighted Monte Carlo integration enable scalable computation.
- →This unification could simplify neural architecture design and enable more efficient multi-task learning systems.