stratos idreos - daslabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data...
TRANSCRIPT
Stratos Idreos
11:30am today, Yifan Wu from Berkeley: Real-Time Interactive Data Analytic Interfaces
11:30am tmr, K. Karanasos from Microsoft: (big data/ML)
Talks at the lab this week:
Discussion papers as of MondayReview submission every class
Academic interest
Market share
Wealth of applicationsdeep le
arning
Some input Some output
Cat
Image recognition
Dog
Table
ΓαταCat
Machine translation
What happens in Vegas…
Auto-complete
…
cat = f ( )
How do we do this mapping?
cat = f ( )
How do we do this mapping?
perceptron
weight
3.4
data
clean/transformactivation function
Multi-layer perceptron“can represent a wide variety of interesting functions”
Multi-layer perceptron
Some weird neural network!
Features to labelsData to features
+
-
Deep neural networks
Cat
Multiple layers
Cat
Training phase: pass labeled data until we get to an acceptable accuracyInference: new data -> result
How do we train these networks?
Labeled
data
How do we train these networks?
Forward pass to compute a prediction
Labeled
data
How do we train these networks?
Forward pass to compute a prediction
Fish
Labeled
data
How do we train these networks?
Forward pass to compute a prediction
Fish
Labeled
data
Backward pass to ‘slightly’ nudge the weights
How do we train these networks?
Forward pass to compute a prediction
Fish
Labeled
data
Backward pass to ‘slightly’ nudge the weights
repeat until happy/convergence
How do we train these networks?
Labeled
data
PERFORMANCE (TRAINING/INFERENCE)
Labeled
data
PERFORMANCE (TRAINING/INFERENCE)
READING/WRITING DATA + COMPUTATION
Labeled
data
ACCURACY
2012 2014 2015 2017
264 layers
152 layers
19 layers7 layers
Deep Residual Learning for Image Recognition [He et. al., 2016]
Layers in state-of-the-art NN
CatDNN
Single ModelEnsemble Model
CatDNN
DNN
DNN
DNN
Single ModelEnsemble Model
CatDNN
DNN
DNN
DNN
Cat
Cat
Catastrophe
CAT
Cat
Cat
Catastrophe
CAT
DNN
DNN
DNN
Representationally richer
Wisdom of the crowd
finding the first principles of neural networks
Cat
Cat
Catastrophe
CAT
DNN
DNN
DNN
VERY EXPENSIVE TO TRAIN
100 variants of VGG-16 (different structures)
CIFAR-10
Dataset Training approachesFull-data Bagging
How expensive is it?
Trai
ning
tim
e (h
rs.)
0
15
30
45
60
Number of neural networks1 21 41 61 81 100
Full-dataBagging
35 min.27 min.
BaggingFull-Data
Time to add one neural network
Trai
ning
tim
e (h
rs.)
0
15
30
45
60
Number of neural networks1 21 41 61 81 100
Full-dataBagging
35 min.27 min.
BaggingFull-Data
CIFAR-10
35 min.27 min.
BaggingFull-Data
CIFAR-10 CIFAR-100 CIFAR-10 | ResNet
67 min.
35 min.
Complex data Complex model
516 min.
206 min.
~250
~10
NN ensembles
other models ensembles
ensemble
size
~250
~10
NN ensembles
other models ensembles
complex problems need larger ensembles
ensemble
size
MotherNets Rapid Deep Ensemble Learningit is all about data movement and computation
MotherNetsDNN architectures
MotherNets
Capture structural similarity
i
MotherNets
Capture structural similarity
i
MotherNet
MotherNets
Capture structural similarity
i
Trained
MotherNets
Capture structural similarity Train once
i ii
Same function as MotherNet
MotherNets
Capture structural similarity Train once Transfer learned function
i ii iii
MotherNets
Capture structural similarity Train once Transfer learned function Train incrementally
i ii iii iv
MotherNets
Capture structural similarity Train once Transfer learned function Train incrementally
i ii iii iv
I. Capture structural similarityFind the largest network structure common amongst all networks
I. Capture structural similarityFind the largest network structure common amongst all networks
20 13
15 8 20
10 18 10 22
5NN1
NN2
NN3
L1 L2 L3 L4
MotherNet
Neu
ral N
etw
ork
Spec
ifica
tions
I. Capture structural similarityFind the largest network structure common amongst all networks
20 13
15 8 20
10 18 10 22
5NN1
NN2
NN3
L1 L2 L3 L4
10
MotherNet
Neu
ral N
etw
ork
Spec
ifica
tions
I. Capture structural similarityFind the largest network structure common amongst all networks
20 13
15 8 20
10 18 10 22
5NN1
NN2
NN3
L1 L2 L3 L4
810
MotherNet
Neu
ral N
etw
ork
Spec
ifica
tions
I. Capture structural similarityFind the largest network structure common amongst all networks
20 13
15 8 20
10 18 10 22
5NN1
NN2
NN3
L1 L2 L3 L4
5810
MotherNet
Neu
ral N
etw
ork
Spec
ifica
tions
I. Capture structural similarityFind the largest network structure common amongst all networks
20 13
15 8 20
10 18 10 22
5NN1
NN2
NN3
L1 L2 L3 L4
5810
MotherNet
Neu
ral N
etw
ork
Spec
ifica
tions
I. Capture structural similarityFind the largest network structure common amongst all networks
20 13
15 8 20
10 18 10 22
5NN1
NN2
NN3
L1 L2 L3 L4
5810
MotherNet
Smaller than any of the ensemble networks
Neu
ral N
etw
ork
Spec
ifica
tions
MotherNets
Capture structural similarity Train once Transfer learned function Train incrementally
i ii iii iv
MotherNets
Capture structural similarity Train once Transfer learned function Train incrementally
i ii iii iv
Function Preserving TransformationsIncrease the capacity (expressivity) of the networks while preserving their function (also accuracy)
II. Transfer learned function
Function Preserving Transformations
Deepen the networkWiden the network
Increase the capacity (expressivity) of the networks while preserving their function (also accuracy)
II. Transfer learned function
Function Preserving TransformationsIncrease the capacity (expressivity) of the networks while preserving their function (also accuracy)
Morph MotherNets to ensemble networks
MotherNets
Capture structural similarity Train once Transfer learned function Train incrementally
i ii iii iv
Capture structural similarity Train once Transfer learned function Train incrementally
i ii iii iv
full data bagging
100 variants of VGG-16 (different structures)
CIFAR-10
Dataset Training approachesFull-data Bagging
How does it behave?
Trai
ning
tim
e (h
rs.)
0
15
30
45
60
Number of neural networks1 21 41 61 81 100
Full-dataBaggingMotherNets
Trai
ning
tim
e (h
rs.)
0
15
30
45
60
Number of neural networks1 21 41 61 81 100
Full-dataBaggingMotherNets
Trai
ning
tim
e (h
rs.)
0
15
30
45
60
Number of neural networks1 21 41 61 81 100
Full-dataBaggingMotherNets
7 min.
35 min.27 min.
BaggingFull-DataMotherNets
Time to add one neural network
7 min.
35 min.27 min.
BaggingFull-DataMotherNets
CIFAR-10
Time to add one neural network
7 min.
35 min.27 min.
BaggingFull-DataMotherNets
CIFAR-10 CIFAR-100 CIFAR-10 | ResNet
14 min.
67 min.
35 min.
155 min.
516 min.
206 min.
Complex data Complex model
Time to add one neural network
Accuracy vs Performance?
220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274
MotherNets
0 2 4 6 8
10 12 14 16 18
Vot
e
EA SL O
erro
r rat
e (%
)
baggingfull-data
KDMotherNets
(a) Ensemble test error rate
0
40
80
120
160
200
FD Bag. KD MNet
train
ing
time
(min
) MNV13V16
V16AV16B
V19
(b) Training time
0
5
10
15
20
25
30
13 12 11 10 9 8
train
ing
time
(min
)
test error rate (%)
MotherNetsfull-data
(c) Tradeoff
6
7
8
erro
r ra
te (%
)
0
100
200
1 2 3 4 5
time
(min
)
Number of clusters
(d) Varying the number of clusters
Figure 3: MotherNets achieves comparable individual and ensemble accuracy to the full-data approach but in a fraction ofthe training time – striking a better time-accuracy tradeoff.
networks in a function-preserving manner, we build a sep-arate MotherNet for each class (or a set of MotherNets ifeach class also consists of networks of diverse sizes). Thisenables MotherNets to closely capture the architecture classof every neural network in the diverse ensemble.
2.4. Training
Step 1: Training the MotherNet. First, the MotherNet forevery cluster is trained from scratch using the entire dataset until convergence. This allows the MotherNet to learn agood core representation of the data. Since the MotherNethas fewer parameters than any of the ensemble networks inits cluster (by construction) it is expected to take less timeto converge than any of the ensemble networks.
Step 2: Hatching Ensemble Networks. Once the Mother-Net is trained, the next step is to generate every ensemblenetwork by expanding the MotherNet through a sequenceof function-preserving transformations. We call this processhatching. Hatching yields the ensemble networks with pos-sibly larger size (more parameters) than the MotherNet butrepresenting the same function as learned by the MotherNet.The hatching process is instantaneous as generating everyensemble network requires a single pass on the Mother-Net. Every hatched ensemble network with size (number ofparameters) larger than the MotherNet now has additionaluntrained parameters that we need to further train. To ex-plicitly add diversity to the hatched networks, we randomlyperturb the newly introduced parameters. This is a stan-dard technique to create diversity when training ensemblenetworks.
Step 3: Training Ensemble Networks. The hatched en-semble networks are trained using bootstrap aggregation(bagging) (Breiman, 1996). Bagging is an effective andwidely used method to create diversity and to reduce thevariance of ensembles (Hansen & Salamon, 1990; Guzman-Rivera et al., 2014; Lee et al., 2015b). This is becausedifferent models in the ensemble are trained using differ-ent overlapping subsets of the data. However, recent work
has shown that training neural network ensembles throughbagging results in decreased generalization accuracy as itreduces the number of unique data items seen by individualneural networks. Since neural networks have a large numberof parameters they are affected relatively more from thisreduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos, 1999;Lee et al., 2015a). Hatching the networks from the Mother-Net and then further training them using bagging overcomesthis problem.
We explain in supplementary material Section B how Moth-erNets’ training process can be parallelized and how state-of-the-art general network training optimizations can beincorporated into MotherNets.
2.5. Optimizing for Inference
While our focus is on reducing training costs, the Mother-Nets approach can also help during inference. To optimizean ensemble trained through MotherNets for inference cost(time and memory requirement), we share the incremen-tal training of MotherNet parameters. To achieve this, weconstruct a shared-MotherNet from the hatched networksand train the shared-MotherNet incrementally instead ofthe hatched networks. In shared-MotherNet, the hatchednetworks are combined together such that they have a sin-gle copy of MotherNet parameters, which is jointly trainedinstead of being trained independently by each ensemblenetwork. This yields an ensemble with fewer number ofparameters that is more efficient to infer from and is morecompact to deploy. We explain this approach to optimizeensemble inference as well as provide experimental resultsin supplementary material Section C.
3. ExperimentsWe show how MotherNets can be used to accelerate thetraining of diverse ensembles of neural networks.
Training Setup. We adopt stochastic gradient descent with
Accuracy vs Performance?
Accelerating inference in MotherNets
Inference cost
Accelerating inference in MotherNets
Inference cost TrainingMaintain common MotherNets parameters
MN
IV —
Incr
emen
tal T
rain
ing
Accelerating inference in MotherNets
Inference cost Training Inference
Full-pass
Maintain common MotherNets parameters
MN
IV —
Incr
emen
tal T
rain
ing
Accelerating inference in MotherNets
Inference cost Training Inference
Full-pass
Partial
passes
Maintain common MotherNets parameters
MN
IV —
Incr
emen
tal T
rain
ing
Accelerating inference in MotherNets
CIFAR-10
Ensemble of 5 VGGNetsLayers between 13 and 34
Initial Results
Accelerating inference in MotherNets
1.95 min.
3.71 min.
StandardShare-MN
Inference time
7.9%7.5%
Test error rate
Initial results
finding the first principles of neural networks
Stratos Idreos