Extracurricular Learning: Knowledge Transfer Beyond Empirical Distribution
AuthorsHadi Pouransari, Mojan Javaheripi, Vinay Sharma, Oncel Tuzel
AuthorsHadi Pouransari, Mojan Javaheripi, Vinay Sharma, Oncel Tuzel
Knowledge distillation has been used to transfer knowledge learned by a sophisticated model (teacher) to a simpler model (student). This technique is widely used to compress model complexity. However, in most applications the compressed student model suffers from an accuracy gap with its teacher. We propose extracurricular learning, a novel knowledge distillation method, that bridges this gap by (1) modeling student and teacher output distributions; (2) sampling examples from an approximation to the underlying data distribution; and (3) matching student and teacher output distributions over this extended set including uncertain samples. We conduct rigorous evaluations on regression and classification tasks and show that compared to the standard knowledge distillation, extracurricular learning reduces the gap by 46% to 68%. This leads to major accuracy improvements compared to the empirical risk minimization-based training for various recent neural network architectures: 16% regression error reduction on the MPIIGaze dataset, +3.4% to +9.1% improvement in top-1 classification accuracy on the CIFAR100 dataset, and +2.9% top-1 improvement on the ImageNet dataset.
August 8, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference Interspeech
Device-directed speech detection (DDSD) is a binary classification task that separates the user’s queries to a voice assistant (VA) from background speech or side conversations. This is important for achieving naturalistic user experience. To this end, we propose knowledge distillation (KD) to enhance DDSD accuracy while ensuring efficient deployment. Specifically, we introduce a novel adaptive KD method that transfers knowledge from general...
July 1, 2025research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICML
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a...