DVGT: Driving Visual Geometry Transformer
Researchers introduce DVGT, a transformer-based model for 3D scene reconstruction in autonomous driving that works without explicit camera parameters. Trained on multiple large driving datasets, the system demonstrates improved performance by directly inferring dense geometry from unposed multi-view sequences, eliminating dependence on precise calibration data.
DVGT represents a meaningful advancement in autonomous driving perception systems by decoupling 3D geometry reconstruction from rigid camera calibration requirements. Traditional autonomous vehicle systems rely heavily on precisely calibrated camera parameters to build accurate 3D scene models, creating engineering bottlenecks and reducing system flexibility across different vehicle configurations. By leveraging transformer architecture with intra-view local attention, cross-view spatial attention, and cross-frame temporal attention mechanisms, DVGT learns to infer geometric relationships directly from image sequences without these explicit constraints.
The research addresses a critical gap in autonomous driving technology. Existing dense geometry perception models struggle to adapt across different scenarios and camera setups, forcing manufacturers to maintain separate pipelines for different hardware configurations. DVGT's architecture, built on DINO visual features and multi-head decoding strategies, enables metric-scaled geometry prediction without post-alignment dependencies on external sensors like LiDAR, reducing system complexity and costs.
The comprehensive training approach combining nuScenes, OpenScene, Waymo, KITTI, and DDAD datasets positions the model for real-world robustness across diverse driving conditions and geographic regions. This multi-dataset training methodology helps the transformer generalize better than single-dataset approaches, potentially improving deployment flexibility for autonomous vehicle manufacturers.
For the autonomous driving industry, this research reduces engineering constraints around camera standardization while improving perception reliability. As autonomous systems mature, models that adapt to varied hardware configurations become increasingly valuable for fleet deployment across different vehicle types and manufacturer partnerships.
- βDVGT eliminates explicit camera calibration requirements by learning geometric relationships directly from image sequences
- βMulti-dataset training on five major autonomous driving datasets enhances model generalization across diverse scenarios
- βMetric-scaled geometry prediction removes dependency on external sensor alignment, reducing system complexity
- βTransformer architecture with temporal attention enables accurate 3D reconstruction from unposed multi-view inputs
- βCamera-agnostic design increases deployment flexibility for autonomous vehicle manufacturers with varying hardware configurations