Rethinking Temporal Consistency in Video Object-Centric Learning: From Prediction to Correspondence
Researchers propose Grounded Correspondence, a new framework for video object tracking that replaces learned prediction models with deterministic bipartite matching. By leveraging existing vision backbone features, the approach achieves competitive results without learnable temporal parameters, challenging the conventional approach of using dynamics modules for temporal consistency.