Wisdom of Committee: Diverse Distillation from Large Foundation Models and Domain Experts
Researchers introduce DiverseDistill, a knowledge distillation framework that leverages multiple teachers (foundation models plus domain experts) to more effectively transfer knowledge to compact models. The method recovers 73-114% of the performance gap between teacher and student models while operating with frozen teachers and zero inference overhead.
DiverseDistill addresses a fundamental challenge in machine learning: compressing large foundation models into smaller, deployable systems without catastrophic performance loss. Traditional single-teacher distillation from a 76M-parameter model to a 2M-parameter student recovers less than 40% of the performance gap, making this a critical bottleneck for edge deployment and cost-effective inference. The innovation lies in treating multiple heterogeneous teachers as a committee rather than averaging their outputs naively.
The framework's technical elegance stems from its practical constraints: it requires no parameter updates to teachers, no co-training, and no architectural modifications. The learnable Question-Answer mechanism dynamically aligns outputs from diverse teachers into the student's representation space, effectively translating between incompatible architectures and modalities. This contrasts sharply with existing approaches requiring gradient-based optimization or model surgery.
For the AI and machine learning industry, this work has significant implications for deployment efficiency. The 38x compression ratio in recommendation systems and 3.6x in vision tasks demonstrates broad applicability across domains. The dynamic teacher importance mechanism reducing forward passes by ~30% addresses computational bottlenecks during training without quality degradation. Organizations can now maintain accuracy standards while reducing inference costs and latency—critical factors for real-time applications and resource-constrained environments.
The zero inference overhead design is particularly valuable for production systems, where the distillation module is discarded after training. This removes the concern of maintaining additional architectural complexity in deployed models. Future work may explore application to other domains and investigation of optimal teacher committee composition for different tasks.
- →DiverseDistill recovers 73-114% of the teacher-student performance gap using multiple heterogeneous teachers versus <40% with single-teacher distillation.
- →The framework operates entirely with frozen teachers using only forward-pass inference, requiring no parameter updates or architectural modifications.
- →Dynamic teacher importance mechanism reduces computational overhead by ~30% during training while maintaining output quality.
- →Achieves 38x compression in recommendation tasks and 3.6x in vision tasks with practical deployment advantages.
- →Zero inference overhead design eliminates architectural complexity in production systems by discarding the distillation module after training.