DiffCrossGait: Trajectory-Level Alignment for 2D-3D Cross-Modal Gait Recognition via Latent Diffusion
DiffCrossGait presents a novel deep learning approach that uses latent diffusion models to improve cross-modal gait recognition between 2D silhouettes and 3D LiDAR data. The method achieves state-of-the-art results on major benchmarks by aligning trajectories during the generative process rather than only at the embedding level, while maintaining computational efficiency during inference.
DiffCrossGait addresses a fundamental challenge in biometric recognition: matching gait patterns across different sensor modalities with inherent domain gaps. Traditional cross-modal matching methods treat 2D and 3D representations as directly equivalent, ignoring the distinct characteristics of silhouette data versus LiDAR point clouds. This research reframes the problem by leveraging diffusion models—generative AI systems that gradually denoise data—as alignment mechanisms throughout the learning process rather than simple embedding matchers.
The technical innovation lies in the Tri-Phase Alignment Strategy, which uses varying noise intensities to enforce three critical constraints: identity consistency, motion dynamics alignment, and structural recoverability across modalities. By driving both 2D and 3D data through shared noise in latent space, the model learns representations that capture gait patterns independent of sensor type. This approach fundamentally differs from prior work by operating at the trajectory level—considering entire motion sequences—rather than individual frames or static features.
The practical advantage emerges from architectural decoupling: diffusion serves purely as a training objective, eliminating iterative denoising costs during inference. This design choice makes the system viable for real-world deployment where computational efficiency matters. Strong benchmark performance on SUSTech1K and FreeGait datasets suggests broad applicability across surveillance, forensic analysis, and authentication systems.
The implications extend beyond gait recognition into any cross-modal biometric or sensor fusion task. The framework demonstrates how diffusion models can strengthen multi-modal learning without sacrificing inference speed—a persistent tension in applied AI. Future research may adapt this alignment strategy to other domains where domain gaps impede modal equivalence.
- →DiffCrossGait reformulates 2D-3D gait matching as trajectory-level alignment in diffusion space, achieving state-of-the-art benchmark results
- →Tri-Phase Alignment Strategy enforces identity anchoring, dynamics consistency, and structural recoverability across different sensor modalities
- →Diffusion operates exclusively during training, preserving inference efficiency and enabling practical deployment
- →Method addresses fundamental domain discrepancies between silhouette and LiDAR representations without assuming full modal equivalence
- →Framework's decoupled architecture proves transferable to other cross-modal biometric and sensor fusion applications