Monocular Biomechanical Tracking of Fingers with Inverse Kinematics to Foundation Models
Researchers developed a method combining SAM 3D Body foundation models with inverse kinematics to accurately track finger joint angles from single monocular video, achieving approximately 10-degree accuracy in finger tracking and 6mm hand position errors. The approach ports existing AI models to JAX and MuJoCo for GPU-accelerated optimization, enabling clinical applications for monitoring hand movement and range of motion from standard video without specialized multi-camera setups.
This research addresses a significant gap in biomechanical analysis by extending monocular video-based hand tracking to clinical-grade accuracy. Traditional finger tracking requires either expensive multi-camera systems or marker-based motion capture, limiting accessibility in clinical and everyday settings. By leveraging foundation models trained on broad visual data and constraining outputs through biomechanical physics simulations, the researchers achieve performance comparable to gold-standard multi-view systems while requiring only standard video input.
The technical innovation lies in the integration approach: porting SAM 3D Body from PyTorch to JAX enables seamless compatibility with MuJoCo-MJX physics engine for GPU-accelerated optimization. This architectural choice allows the model to respect anatomical constraints—bones cannot bend arbitrarily—improving accuracy beyond what raw computer vision alone provides. The novel mapping between the Momentum Human Rig output space and biomechanical model markers represents crucial domain-specific engineering.
For healthcare and clinical applications, this development democratizes access to quantitative hand movement analysis. Physical therapists, neurologists, and occupational health professionals could assess range-of-motion, fine motor control, and recovery progress using smartphone or webcam video. The 10-degree finger angle error represents clinically acceptable accuracy for many applications, though some specialized assessments may require higher precision.
The work demonstrates how foundation models, when combined with domain knowledge (biomechanics) and appropriate computational frameworks (JAX/MuJoCo), can solve previously intractable problems. Future iterations may incorporate temporal consistency across video frames and real-time processing capabilities, further expanding clinical utility. Validation across multiple participants and camera viewpoints establishes generalization potential.
- →Monocular video finger tracking now achieves ~10-degree accuracy in joint angles using foundation models combined with physics-based optimization
- →Integration of SAM 3D Body with JAX and MuJoCo enables GPU-accelerated biomechanical analysis without expensive multi-camera systems
- →Method validates against multi-view gold-standard on 4,590 frames, demonstrating clinical-grade accuracy from standard video
- →Approach expands access to quantitative hand movement assessment for physical therapy, neurology, and occupational health applications
- →Foundation models constrained by biomechanical physics outperform unconstrained computer vision for anatomically-plausible pose estimation