knowledge distillation by on-the-fly nativeensemblexl309/doc/publication/2018/... · figure...
TRANSCRIPT
Knowledge Distillation by On-the-Fly Native EnsembleXu Lan& Xiatian Zhu+ Shaogang Gong&
1Queen Mary University of London, London, UK 2Vision Semantics [email protected] [email protected] [email protected]
1. Introduction
Solution: Knowledge Distillation
2. Methodology
Cross Entropy Hard vs. Soft Class Labels:
l ONE leads to higher correlationsdue to the learning constraint fromthe distillation loss;
l ONE yields superior mean modelgeneralisation capability with lowererror rate 26.61 vs 31.07 by ME.
3. Experiments
4. Further Analysis
ØCIFAR and SVHN tests
Figure 1: Knowledge distillation and vanilla training
Figure 2: Overview of online distillation training of ResNet-110 by the proposed On-the-Fly Native Ensemble (ONE). With ONE, we reconfigure the network by addingm auxiliary branches. Each branch with shared layers makes an individual model, and their ensemble is used to build the teacher model.
Knowledge Distillation by On-the-Fly Native Ensemble
Ø ImageNet test
ØKnowledge Distillation and Ensemble Comparisons
5. Reference
(1) Model variance: average prediction differences between everytwo models/branches. (2) Mean model generalisation capability. [2] Y. Zhang, X. Tao, T. Hospedales, and H. Lu. “Deep mutual learning.” CVPR, 2018.
[1] G. Hinton, O. Vinyals, and J. Dean. “Distilling the knowledge in a neural network.” arXiv:1503.02531, 2015.
Multi-Branch Design:
Ø Considering no correlation between classes.Ø Prone to model overfitting
(b) knowledge distillation(a) Vanilla strategy
Target Model
P L Teacher Model
P L Target Model
P L
P Model PredictionsL Labels TS Training Samples
TS TS TS
Legend
Limitations of Offline Knowledge Distillation:Ø Lengthy training time
Ø Complex multi-stage training process
Ø Possible teacher model overfitting
Limitations of Online Knowledge Distillation:
Ø Provide limited extra supervision information
Ø Still need to train multiply models
Ø Complex Asynchronous model updating
Classifier
…
Branch0
Gate
Logits
!"##!"#$
!"#%
!"#&
!'($
!'(%
!'(&
)$*
)%*
)&*
)#
)%
)&
)$
)#*
EnsemblePredictions
On-the-FlyKnowledgeDistillation
EnsembleLogits
High-Level Layers(Res4X block)
Low-Level Layers (Conv1, Res2X, Res3X)
Branch 1
Branchm
Predictions
Gate Network:
On-the-Fly Knowledge Distillation:
A gate which learns to ensemble all (m + 1) branches tobuild a stronger teacher:
m auxiliary branches with the same configuration, each serving as an independent efficient classification model.
Compute soft probability distributions at a temperature ofT for branches and the ONE teacher as:
Distill knowledge from the teacher to each branch:
�
���
�
���
� ��� ��� ���
� �� ���
�����
������������ �
Drawbacks of Hard Label based Cross Entropy:
Overall Loss Function:
ONE vs. Model Ensemble (ME)
ØEffect of On-the-Fly Knowledge Distillation