Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving
Researchers introduce Drive-KD, a knowledge distillation framework that compresses large vision-language models for autonomous driving by decomposing the task into perception, reasoning, and planning components. The method achieves superior performance with 42x less GPU memory and 11.4x higher throughput compared to larger baseline models, advancing the practical deployment of AI in safety-critical driving systems.
Drive-KD addresses a fundamental challenge in deploying advanced AI systems for autonomous driving: the tension between model capability and computational efficiency. Large vision-language models demonstrate strong reasoning abilities but consume prohibitive resources for real-time driving applications where latency and memory constraints are critical. This research leverages knowledge distillation, a well-established technique for transferring learned patterns from larger models to smaller ones, but applies it strategically by decomposing autonomous driving into distinct capability domains.
The framework's innovation lies in its multi-teacher architecture and asymmetric gradient projection mechanism, which prevents conflicting optimization signals when training on multiple capabilities simultaneously. By identifying layer-specific attention patterns as distillation targets, the researchers create more effective knowledge transfer channels tailored to perception, reasoning, and planning tasks. This modular approach reflects the actual cognitive requirements of autonomous systems rather than treating driving as a monolithic prediction problem.
The reported performance metrics are particularly significant: achieving comparable or superior results to a 78-billion parameter model using only 1.8 billion parameters demonstrates substantial progress toward efficient AI systems. The ability to surpass GPT-5.1 on planning tasks suggests the method captures domain-specific knowledge effectively. For autonomous vehicle developers, this means potential deployment on edge devices with lower compute budgets and faster inference times, directly improving safety response capabilities.
Future developments should focus on validating these results on real-world driving scenarios and exploring whether similar distillation strategies apply to other safety-critical AI applications. The generalization across model families suggests the approach may have broader applicability beyond autonomous driving.
- βDrive-KD reduces GPU memory requirements by 42x and increases throughput by 11.4x while maintaining or improving performance on autonomous driving tasks.
- βMulti-teacher knowledge distillation with asymmetric gradient projection successfully transfers perception, reasoning, and planning capabilities to smaller models.
- βThe distilled 1.8B parameter model outperforms a 78B baseline from the same family and exceeds GPT-5.1 on planning benchmarks.
- βDecomposing autonomous driving into capability-specific domains enables more effective knowledge transfer than standard fine-tuning approaches.
- βThe method demonstrates generalization across diverse model families and scales, suggesting broader applicability to safety-critical AI systems.