The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection
Researchers demonstrate that the highest-performing teacher model doesn't necessarily provide the best training data for student models. They propose Student-Centric Answer Sampling (SCAS), a framework that selects answers based on their estimated learning value for specific students rather than teacher strength alone, showing consistent performance improvements across 30 teacher models and 8 tasks.
This research addresses a fundamental assumption in large language model training that has gone largely unexamined: that the strongest teacher produces the best supervision. The study reveals that answer quality is not monolithic—what works for one student may be suboptimal for another, even when multiple teachers provide correct solutions to identical problems.
The efficiency of LLM training increasingly depends on synthetic data generation and knowledge distillation from larger models. As organizations deploy multiple teacher models of varying capabilities, the selection of training data becomes critical. The paper's finding that teacher performance doesn't correlate directly with teaching effectiveness has significant implications for how organizations allocate computational resources during model development.
SCAS introduces a practical mechanism using token-wise gradient decomposition to estimate learning costs without expensive backpropagation. This forward-only proxy makes the approach computationally feasible for large-scale training scenarios. The consistency of improvements across diverse experimental conditions—30 different teacher models, 6 student architectures, and 8 distinct tasks—suggests the framework captures something fundamental about the learning process rather than task-specific artifacts.
For the AI development community, these findings suggest that distillation strategies should become more sophisticated and student-aware. Rather than consolidating around single best-performing teachers, training pipelines might benefit from maintaining diverse teacher ensembles and matching answers to learner needs. This could shift how AI labs structure their training infrastructure, potentially reducing computational waste while improving student model quality.
- →Strongest teachers don't necessarily provide the best training supervision for student models despite generating correct answers
- →Student-Centric Answer Sampling framework selects answers based on estimated learning cost rather than teacher performance alone
- →Forward-only gradient proxy enables efficient, scalable answer selection without expensive backpropagation computations
- →Improvements demonstrated consistently across 30 teacher models, 6 student architectures, and 8 different tasks
- →Effective distillation requires matching supervision quality to individual student needs, not just maximizing teacher strength