Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers
Researchers introduce Tensor Memory, a fixed-size recurrent module that augments Transformers with persistent 3D spatial state for improved long-sequence processing. The approach enables better video understanding and occlusion reasoning by decoupling memory capacity from input length while maintaining computational efficiency.
Tensor Memory addresses a fundamental limitation in current Transformer architectures: their struggle with long-horizon tasks requiring persistent spatial reasoning. Traditional Transformers flatten spatial and temporal information into linear token sequences, forcing attention mechanisms to scale quadratically with sequence length. This architectural constraint makes tasks like video understanding and occlusion-sensitive reasoning computationally expensive and cognitively difficult for the model. The proposed solution introduces a learnable 3D voxel grid that maintains constant size regardless of input length, fundamentally changing how models accumulate and access spatial information.
The mechanism operates through three key operations: tokens write content into the memory grid via soft writes centered on predicted 3D coordinates, local interaction operators update the memory efficiently, and tokens read contextual information back through continuous sampling. This design preserves spatial inductive biases—crucial for visual reasoning—while avoiding the memory scaling problems of KV caching. The approach represents an incremental but meaningful advancement in Transformer design, building on decades of research in recurrent neural networks and spatial reasoning.
For the broader AI community, Tensor Memory demonstrates that architectural innovations targeting specific bottlenecks remain valuable despite Transformer dominance. The method's compatibility with existing training pipelines and modular design makes adoption straightforward for researchers and practitioners. The comprehensive evaluation across language, image, and video benchmarks signals robust applicability rather than narrow optimization.
The research suggests continued evolution in sequence modeling beyond pure attention mechanisms. Hybrid architectures combining Transformers with spatial memory structures may become increasingly relevant for tasks demanding persistent state—including robotics, autonomous systems, and extended video analysis.
- →Tensor Memory introduces fixed-size 3D recurrent state to Transformers, decoupling memory capacity from sequence length
- →The module enables better long-horizon video understanding and occlusion-sensitive reasoning through persistent spatial representation
- →Differentiable soft writes and continuous sampling allow efficient gradient flow while preserving spatial inductive biases
- →Architecture integrates seamlessly into existing Transformer blocks without requiring fundamental redesign
- →Evaluation spans language, vision, and video tasks, suggesting broad applicability beyond specialized domains