AIBullisharXiv – CS AI · 10h ago6/10
🧠
Distilling 3D Spatial Reasoning into a Lightweight Vision-Language Model with CoT
Researchers have developed a knowledge distillation framework that compresses a 7B 3D vision-language model into a 2.29B student model, achieving 8.7x faster inference while retaining 54-72% performance. The approach introduces "Hidden CoT," learnable latent tokens that enable spatial reasoning without explicit chain-of-thought training data, making 3D scene understanding feasible on resource-constrained devices.