knowledge distillation by on-the-fly nativeensemblexl309/doc/publication/2018/... · figure...

1
Knowledge Distillation by On-the-Fly Native Ensemble Xu Lan & Xiatian Zhu + Shaogang Gong & 1 Queen Mary University of London, London, UK 2 Vision Semantics Ltd [email protected] [email protected] [email protected] 1. Introduction Solution: Knowledge Distillation 2. Methodology Cross Entropy Hard vs. Soft Class Labels: l ONE leads to higher correlations due to the learning constraint from the distillation loss; l ONE yields superior mean model generalisation capability with lower error rate 26.61 vs 31.07 by ME. 3. Experiments 4 . Further Analysis Ø CIFAR and SVHN tests Figure 1: Knowledge distillation and vanilla training Figure 2: Overview of online distillation training of ResNet-110 by the proposed On- the-Fly Native Ensemble (ONE). With ONE, we reconfigure the network by adding m auxiliary branches. Each branch with shared layers makes an individual model, and their ensemble is used to build the teacher model. Knowledge Distillation by On-the-Fly Native Ensemble Ø ImageNet test Ø Knowledge Distillation and Ensemble Comparisons 5 . Reference (1) Model variance: average prediction differences between every two models/branches. (2) Mean model generalisation capability. [2]Y. Zhang, X. Tao, T. Hospedales, and H. Lu. “Deep mutual learning.” CVPR, 2018. [1] G. Hinton, O. Vinyals, and J. Dean. “Distilling the knowledge in a neural network.” arXiv:1503.02531, 2015. Multi-Branch Design: Ø Considering no correlation between classes. Ø Prone to model overfitting (b) knowledge distillation (a) Vanilla strategy Target Model P L Teacher Model P L Target Model P L P Model Predictions L Labels TS Training Samples TS TS TS Legend Limitations of Offline Knowledge Distillation: Ø Lengthy training time Ø Complex multi-stage training process Ø Possible teacher model overfitting Limitations of Online Knowledge Distillation: Ø Provide limited extra supervision information Ø Still need to train multiply models Ø Complex Asynchronous model updating Classifier Branch 0 Gate Logits ! "# # ! "# $ ! "# % ! "# & ! '( $ ! '( % ! '( & ) $ * ) % * ) & * ) # ) % ) & ) $ ) # * Ensemble Predictions On-the-Fly Knowledge Distillation Ensemble Logits High-Level Layers (Res4X block) Low-Level Layers (Conv1, Res2X, Res3X) Branch 1 Branch m Predictions Gate Network: On-the-Fly Knowledge Distillation: A gate which learns to ensemble all (m + 1) branches to build a stronger teacher: m auxiliary branches with the same configuration, each serving as an independent efficient classification model. Compute soft probability distributions at a temperature of T for branches and the ONE teacher as: Distill knowledge from the teacher to each branch: Drawbacks of Hard Label based Cross Entropy: Overall Loss Function: ONE vs. Model Ensemble (ME) Ø Effect of On-the-Fly Knowledge Distillation

Upload: others

Post on 17-Jul-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Knowledge Distillation by On-the-Fly NativeEnsemblexl309/Doc/Publication/2018/... · Figure 1:Knowledge distillation and vanilla training Figure 2:Overview of online distillation

Knowledge Distillation by On-the-Fly Native EnsembleXu Lan& Xiatian Zhu+ Shaogang Gong&

1Queen Mary University of London, London, UK 2Vision Semantics [email protected] [email protected] [email protected]

1. Introduction

Solution: Knowledge Distillation

2. Methodology

Cross Entropy Hard vs. Soft Class Labels:

l ONE leads to higher correlationsdue to the learning constraint fromthe distillation loss;

l ONE yields superior mean modelgeneralisation capability with lowererror rate 26.61 vs 31.07 by ME.

3. Experiments

4. Further Analysis

ØCIFAR and SVHN tests

Figure 1: Knowledge distillation and vanilla training

Figure 2: Overview of online distillation training of ResNet-110 by the proposed On-the-Fly Native Ensemble (ONE). With ONE, we reconfigure the network by addingm auxiliary branches. Each branch with shared layers makes an individual model, and their ensemble is used to build the teacher model.

Knowledge Distillation by On-the-Fly Native Ensemble

Ø ImageNet test

ØKnowledge Distillation and Ensemble Comparisons

5. Reference

(1) Model variance: average prediction differences between everytwo models/branches. (2) Mean model generalisation capability. [2] Y. Zhang, X. Tao, T. Hospedales, and H. Lu. “Deep mutual learning.” CVPR, 2018.

[1] G. Hinton, O. Vinyals, and J. Dean. “Distilling the knowledge in a neural network.” arXiv:1503.02531, 2015.

Multi-Branch Design:

Ø Considering no correlation between classes.Ø Prone to model overfitting

(b) knowledge distillation(a) Vanilla strategy

Target Model

P L Teacher Model

P L Target Model

P L

P Model PredictionsL Labels TS Training Samples

TS TS TS

Legend

Limitations of Offline Knowledge Distillation:Ø Lengthy training time

Ø Complex multi-stage training process

Ø Possible teacher model overfitting

Limitations of Online Knowledge Distillation:

Ø Provide limited extra supervision information

Ø Still need to train multiply models

Ø Complex Asynchronous model updating

Classifier

Branch0

Gate

Logits

!"##!"#$

!"#%

!"#&

!'($

!'(%

!'(&

)$*

)%*

)&*

)#

)%

)&

)$

)#*

EnsemblePredictions

On-the-FlyKnowledgeDistillation

EnsembleLogits

High-Level Layers(Res4X block)

Low-Level Layers (Conv1, Res2X, Res3X)

Branch 1

Branchm

Predictions

Gate Network:

On-the-Fly Knowledge Distillation:

A gate which learns to ensemble all (m + 1) branches tobuild a stronger teacher:

m auxiliary branches with the same configuration, each serving as an independent efficient classification model.

Compute soft probability distributions at a temperature ofT for branches and the ONE teacher as:

Distill knowledge from the teacher to each branch:

���

���

� ��� ��� ���

� �� ���

�����

������������ �

Drawbacks of Hard Label based Cross Entropy:

Overall Loss Function:

ONE vs. Model Ensemble (ME)

ØEffect of On-the-Fly Knowledge Distillation