stratos idreos - daslabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data...

79
Stratos Idreos

Upload: others

Post on 07-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Stratos Idreos

Page 2: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

11:30am today, Yifan Wu from Berkeley: Real-Time Interactive Data Analytic Interfaces

11:30am tmr, K. Karanasos from Microsoft: (big data/ML)

Talks at the lab this week:

Discussion papers as of MondayReview submission every class

Page 3: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Academic interest

Market share

Wealth of applicationsdeep le

arning

Page 4: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Some input Some output

Page 5: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Cat

Image recognition

Dog

Table

Page 6: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

ΓαταCat

Machine translation

Page 7: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

What happens in Vegas…

Auto-complete

Page 8: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

cat = f ( )

How do we do this mapping?

Page 9: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

cat = f ( )

How do we do this mapping?

Page 10: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

perceptron

weight

3.4

data

clean/transformactivation function

Page 11: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Multi-layer perceptron“can represent a wide variety of interesting functions”

Page 12: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Multi-layer perceptron

Page 13: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Some weird neural network!

Features to labelsData to features

+

-

Deep neural networks

Page 14: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Cat

Multiple layers

Page 15: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Cat

Training phase: pass labeled data until we get to an acceptable accuracyInference: new data -> result

Page 16: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

How do we train these networks?

Page 17: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Labeled

data

How do we train these networks?

Page 18: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Forward pass to compute a prediction

Labeled

data

How do we train these networks?

Page 19: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Forward pass to compute a prediction

Fish

Labeled

data

How do we train these networks?

Page 20: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Forward pass to compute a prediction

Fish

Labeled

data

Backward pass to ‘slightly’ nudge the weights

How do we train these networks?

Page 21: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Forward pass to compute a prediction

Fish

Labeled

data

Backward pass to ‘slightly’ nudge the weights

repeat until happy/convergence

How do we train these networks?

Page 22: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Labeled

data

PERFORMANCE (TRAINING/INFERENCE)

Page 23: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Labeled

data

PERFORMANCE (TRAINING/INFERENCE)

READING/WRITING DATA + COMPUTATION

Page 24: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Labeled

data

ACCURACY

Page 25: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,
Page 26: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

2012 2014 2015 2017

264 layers

152 layers

19 layers7 layers

Deep Residual Learning for Image Recognition [He et. al., 2016]

Layers in state-of-the-art NN

Page 27: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

CatDNN

Page 28: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Single ModelEnsemble Model

CatDNN

DNN

DNN

DNN

Page 29: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Single ModelEnsemble Model

CatDNN

DNN

DNN

DNN

Cat

Cat

Catastrophe

CAT

Page 30: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Cat

Cat

Catastrophe

CAT

DNN

DNN

DNN

Representationally richer

Wisdom of the crowd

Page 31: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

finding the first principles of neural networks

Page 32: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Cat

Cat

Catastrophe

CAT

DNN

DNN

DNN

VERY EXPENSIVE TO TRAIN

Page 33: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

100 variants of VGG-16 (different structures)

CIFAR-10

Dataset Training approachesFull-data Bagging

How expensive is it?

Page 34: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Trai

ning

tim

e (h

rs.)

0

15

30

45

60

Number of neural networks1 21 41 61 81 100

Full-dataBagging

Page 35: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

35 min.27 min.

BaggingFull-Data

Time to add one neural network

Trai

ning

tim

e (h

rs.)

0

15

30

45

60

Number of neural networks1 21 41 61 81 100

Full-dataBagging

Page 36: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

35 min.27 min.

BaggingFull-Data

CIFAR-10

Page 37: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

35 min.27 min.

BaggingFull-Data

CIFAR-10 CIFAR-100 CIFAR-10 | ResNet

67 min.

35 min.

Complex data Complex model

516 min.

206 min.

Page 38: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

~250

~10

NN ensembles

other models ensembles

ensemble

size

Page 39: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

~250

~10

NN ensembles

other models ensembles

complex problems need larger ensembles

ensemble

size

Page 40: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNets Rapid Deep Ensemble Learningit is all about data movement and computation

Page 41: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNetsDNN architectures

Page 42: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNets

Capture structural similarity

i

Page 43: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNets

Capture structural similarity

i

Page 44: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNet

MotherNets

Capture structural similarity

i

Page 45: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Trained

MotherNets

Capture structural similarity Train once

i ii

Page 46: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Same function as MotherNet

MotherNets

Capture structural similarity Train once Transfer learned function

i ii iii

Page 47: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNets

Capture structural similarity Train once Transfer learned function Train incrementally

i ii iii iv

Page 48: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNets

Capture structural similarity Train once Transfer learned function Train incrementally

i ii iii iv

Page 49: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

I. Capture structural similarityFind the largest network structure common amongst all networks

Page 50: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

I. Capture structural similarityFind the largest network structure common amongst all networks

20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

MotherNet

Neu

ral N

etw

ork

Spec

ifica

tions

Page 51: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

I. Capture structural similarityFind the largest network structure common amongst all networks

20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

10

MotherNet

Neu

ral N

etw

ork

Spec

ifica

tions

Page 52: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

I. Capture structural similarityFind the largest network structure common amongst all networks

20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

810

MotherNet

Neu

ral N

etw

ork

Spec

ifica

tions

Page 53: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

I. Capture structural similarityFind the largest network structure common amongst all networks

20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

5810

MotherNet

Neu

ral N

etw

ork

Spec

ifica

tions

Page 54: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

I. Capture structural similarityFind the largest network structure common amongst all networks

20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

5810

MotherNet

Neu

ral N

etw

ork

Spec

ifica

tions

Page 55: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

I. Capture structural similarityFind the largest network structure common amongst all networks

20 13

15 8 20

10 18 10 22

5NN1

NN2

NN3

L1 L2 L3 L4

5810

MotherNet

Smaller than any of the ensemble networks

Neu

ral N

etw

ork

Spec

ifica

tions

Page 56: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNets

Capture structural similarity Train once Transfer learned function Train incrementally

i ii iii iv

Page 57: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNets

Capture structural similarity Train once Transfer learned function Train incrementally

i ii iii iv

Page 58: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Function Preserving TransformationsIncrease the capacity (expressivity) of the networks while preserving their function (also accuracy)

II. Transfer learned function

Page 59: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Function Preserving Transformations

Deepen the networkWiden the network

Increase the capacity (expressivity) of the networks while preserving their function (also accuracy)

II. Transfer learned function

Page 60: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Function Preserving TransformationsIncrease the capacity (expressivity) of the networks while preserving their function (also accuracy)

Morph MotherNets to ensemble networks

Page 61: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

MotherNets

Capture structural similarity Train once Transfer learned function Train incrementally

i ii iii iv

Page 62: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Capture structural similarity Train once Transfer learned function Train incrementally

i ii iii iv

full data bagging

Page 63: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

100 variants of VGG-16 (different structures)

CIFAR-10

Dataset Training approachesFull-data Bagging

How does it behave?

Page 64: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,
Page 65: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Trai

ning

tim

e (h

rs.)

0

15

30

45

60

Number of neural networks1 21 41 61 81 100

Full-dataBaggingMotherNets

Page 66: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Trai

ning

tim

e (h

rs.)

0

15

30

45

60

Number of neural networks1 21 41 61 81 100

Full-dataBaggingMotherNets

Page 67: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Trai

ning

tim

e (h

rs.)

0

15

30

45

60

Number of neural networks1 21 41 61 81 100

Full-dataBaggingMotherNets

7 min.

35 min.27 min.

BaggingFull-DataMotherNets

Time to add one neural network

Page 68: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

7 min.

35 min.27 min.

BaggingFull-DataMotherNets

CIFAR-10

Time to add one neural network

Page 69: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

7 min.

35 min.27 min.

BaggingFull-DataMotherNets

CIFAR-10 CIFAR-100 CIFAR-10 | ResNet

14 min.

67 min.

35 min.

155 min.

516 min.

206 min.

Complex data Complex model

Time to add one neural network

Page 70: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Accuracy vs Performance?

Page 71: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

MotherNets

0 2 4 6 8

10 12 14 16 18

Vot

e

EA SL O

erro

r rat

e (%

)

baggingfull-data

KDMotherNets

(a) Ensemble test error rate

0

40

80

120

160

200

FD Bag. KD MNet

train

ing

time

(min

) MNV13V16

V16AV16B

V19

(b) Training time

0

5

10

15

20

25

30

13 12 11 10 9 8

train

ing

time

(min

)

test error rate (%)

MotherNetsfull-data

(c) Tradeoff

6

7

8

erro

r ra

te (%

)

0

100

200

1 2 3 4 5

time

(min

)

Number of clusters

(d) Varying the number of clusters

Figure 3: MotherNets achieves comparable individual and ensemble accuracy to the full-data approach but in a fraction ofthe training time – striking a better time-accuracy tradeoff.

networks in a function-preserving manner, we build a sep-arate MotherNet for each class (or a set of MotherNets ifeach class also consists of networks of diverse sizes). Thisenables MotherNets to closely capture the architecture classof every neural network in the diverse ensemble.

2.4. Training

Step 1: Training the MotherNet. First, the MotherNet forevery cluster is trained from scratch using the entire dataset until convergence. This allows the MotherNet to learn agood core representation of the data. Since the MotherNethas fewer parameters than any of the ensemble networks inits cluster (by construction) it is expected to take less timeto converge than any of the ensemble networks.

Step 2: Hatching Ensemble Networks. Once the Mother-Net is trained, the next step is to generate every ensemblenetwork by expanding the MotherNet through a sequenceof function-preserving transformations. We call this processhatching. Hatching yields the ensemble networks with pos-sibly larger size (more parameters) than the MotherNet butrepresenting the same function as learned by the MotherNet.The hatching process is instantaneous as generating everyensemble network requires a single pass on the Mother-Net. Every hatched ensemble network with size (number ofparameters) larger than the MotherNet now has additionaluntrained parameters that we need to further train. To ex-plicitly add diversity to the hatched networks, we randomlyperturb the newly introduced parameters. This is a stan-dard technique to create diversity when training ensemblenetworks.

Step 3: Training Ensemble Networks. The hatched en-semble networks are trained using bootstrap aggregation(bagging) (Breiman, 1996). Bagging is an effective andwidely used method to create diversity and to reduce thevariance of ensembles (Hansen & Salamon, 1990; Guzman-Rivera et al., 2014; Lee et al., 2015b). This is becausedifferent models in the ensemble are trained using differ-ent overlapping subsets of the data. However, recent work

has shown that training neural network ensembles throughbagging results in decreased generalization accuracy as itreduces the number of unique data items seen by individualneural networks. Since neural networks have a large numberof parameters they are affected relatively more from thisreduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos, 1999;Lee et al., 2015a). Hatching the networks from the Mother-Net and then further training them using bagging overcomesthis problem.

We explain in supplementary material Section B how Moth-erNets’ training process can be parallelized and how state-of-the-art general network training optimizations can beincorporated into MotherNets.

2.5. Optimizing for Inference

While our focus is on reducing training costs, the Mother-Nets approach can also help during inference. To optimizean ensemble trained through MotherNets for inference cost(time and memory requirement), we share the incremen-tal training of MotherNet parameters. To achieve this, weconstruct a shared-MotherNet from the hatched networksand train the shared-MotherNet incrementally instead ofthe hatched networks. In shared-MotherNet, the hatchednetworks are combined together such that they have a sin-gle copy of MotherNet parameters, which is jointly trainedinstead of being trained independently by each ensemblenetwork. This yields an ensemble with fewer number ofparameters that is more efficient to infer from and is morecompact to deploy. We explain this approach to optimizeensemble inference as well as provide experimental resultsin supplementary material Section C.

3. ExperimentsWe show how MotherNets can be used to accelerate thetraining of diverse ensembles of neural networks.

Training Setup. We adopt stochastic gradient descent with

Accuracy vs Performance?

Page 72: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Accelerating inference in MotherNets

Inference cost

Page 73: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Accelerating inference in MotherNets

Inference cost TrainingMaintain common MotherNets parameters

MN

IV —

Incr

emen

tal T

rain

ing

Page 74: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Accelerating inference in MotherNets

Inference cost Training Inference

Full-pass

Maintain common MotherNets parameters

MN

IV —

Incr

emen

tal T

rain

ing

Page 75: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Accelerating inference in MotherNets

Inference cost Training Inference

Full-pass

Partial

passes

Maintain common MotherNets parameters

MN

IV —

Incr

emen

tal T

rain

ing

Page 76: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Accelerating inference in MotherNets

CIFAR-10

Ensemble of 5 VGGNetsLayers between 13 and 34

Initial Results

Page 77: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Accelerating inference in MotherNets

1.95 min.

3.71 min.

StandardShare-MN

Inference time

7.9%7.5%

Test error rate

Initial results

Page 78: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

finding the first principles of neural networks

Page 79: Stratos Idreos - DASlabdaslab.seas.harvard.edu/classes/cs265/files/... · reduction in unique data items than ensembles of other clas-sifiers such as decision trees or SVMs (Domingos,

Stratos Idreos